Showing posts with label Artificial intelligence. Show all posts
Showing posts with label Artificial intelligence. Show all posts

Wednesday, September 18, 2024

Generative AI with Kafka & Spark

Target audience: Advanced
Estimated reading time: 6'

While Python is predominantly recognized as the go-to programming language for data science, leveraging Java-based frameworks can offer substantial advantages, especially for rapid distributed inference.

In this article, we will outline the structure of a swift, distributed inference system for Bidirectional Encoder Representations from Transformers (BERT) models [ref 1], harnessing the capabilities of Apache Spark, Kafka, and the Deep Java Library (DJL).


Table of contents
     Architecture
Follow me on LinkedIn

What you will learn: How to leverage Apache Kafka, Spark & Deep Java Library for faster inference on transformer models.

Notes:
  • This article doesn't delve into the specifics of Apache Spark, Kafka, Deep Java Library, or BERT individually. Instead, it focuses on how these components are integrated to create an efficient solution for inference tasks.
  • Development environments: JDK 11, Scala 2.12.15, Apache Spark 3.3.1, Apache Kafka 2.8.0, Deep Java Library 0.20.0
  • Comments and ancillary code are omitted for the sake of clarity.
  • Source code available at https://github.com/patnicolas/bertspark

Combining the best of both worlds

Python is a popular environment for developing and training deep learning models like TensorFlow, PyTorch, and MXNet. As an interpreted language, it offers data scientists the flexibility of notebooks for interactive development, evaluation, and refinement of neural models. Python boasts an extensive library encompassing natural language processing, machine learning models, statistical algorithms, and data management tools.

However, this dynamic development environment faces two significant challenges in runtime inference:
  1. Python's limited capacity for task parallelization, whether through concurrent threads or distributing tasks across a network.
  2. Commercial applications often depend on web services running on Java Virtual Machine (JVM) and make extensive use of Apache's open-source libraries.
This raises the question: Can we use Python to define, train, and evaluate deep learning models, and then employ JVM-based languages for real-time inference?

The solution hinges on the fact that deep learning frameworks like PyTorch and TensorFlow are fundamentally binary executables written in C++. The binary versions of these deep learning libraries can be accessed by both Python and Java through their respective interfaces.

Java/Scala for inference

As mentioned earlier, Python frameworks are frequently employed for training deep learning models. However, when it comes to deployment into production (specifically for inference purposes), these models need to be merged with current applications that are written in Java or Scala. This integration often involves utilizing data processing frameworks like Flink, Presto, or Spark.

This particular study concentrates on the integration of a BERT model into an existing Spark application. Apache Spark is renowned for its ability to rapidly process large datasets concurrently across multiple distributed services, as noted [ref 1]. Meanwhile, the Deep Java Library serves as a Java library that implements the most popular deep learning models, providing access via a Java native interface.

Training (Python) and Inference (Java/Scala) stack


Apache Spark and Amazon's Deep Java Library (DJL) tackle the two main challenges associated with deploying machine learning models in production that were developed using Python.

Typically, the process involves creating models in a Python environment like Jupyter, an IDE, or Anaconda, and then saving the model parameters. DJL then takes over by loading these saved parameters and initializing the inference model, which is then ready to handle runtime requests.

Distributed inference pipeline

The goal is to utilize Apache Spark for distributed computation and Kafka for asynchronous, or non-blocking, data queuing.
By integrating these two technologies, we can enhance the scalability of predictions by parallelizing the execution of deep learning models. The critical components of this distributed inference pipeline include:
  • Apache Spark: This tool segments runtime requests for predictions into batches. These batches are then processed concurrently across remote worker nodes.
  • Apache Kafka: This acts as an asynchronous messaging queue, effectively separating the client application from the inference pipeline, ensuring smooth data flow without bottlenecks.
  • Deep Java Library (DJL): It connects with the binary executables of the deep learning models.
  • Kubernetes: This system containerizes the instances of the inference pipelines, facilitating scalable and automated deployment. Notably, Spark version 3.2 and later versions offer direct integration with Kubernetes.
  • Deep Learning Frameworks: This includes well-known frameworks like TensorFlow, MXNet, and PyTorch, which are part of the overall architecture.
Through this combination, we achieve a robust and scalable system for managing and executing deep learning model inferences efficiently.

Generic data flow for Inference of deep learning models with DJL

The two main benefits of such pipeline are simplicity (all tasks/processes run on JVM) and low latency.

Note: Spark and DJL can also be used in the training phase to distribute the training of a mini batch.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform for high volume data pipelines, streaming analytics, data integration, and mission-critical applications. Kafka supports event streaming ensures a continuous flow of data through a pipeline or sequence of transformation such as Extract, Transform and Load [ref 2].

First ,we construct the handler class, KafkaPrediction that

  1. consumes requests from Kafka topic consumeTopic
  2. invokes the prediction model and transformation, predictionPipeline
  3. produces prediction into Kafka topic produceTopic
The actual request is wrapped into the consumed message, RequestMessage. Same for the prediction produced back to the Kafka queue.

class KafkaPrediction(
 consumeTopic: String,
 produceTopic: String,
 predictionPipeline: Seq[Request] => Seq[Prediction])  {
     
      // 1 - Constructs the transform of Kafka messages for prediction
  val transform = (requestMsg: Seq[RequestMessage]) => {
      // 2- Invoke the execution of the pipeline
      val predictions = predictionPipeline(requestMsg.map(_.requestPayload))
      predictions.map(ResponseMessage(_))
  } 
    
    // 3- Build the Kafka consumer for prediction request
  val consumer = new KafkaConsumer[RequestMessage](
    RequestSerDe.deserializingClass,
    consumeTopic
  )
    // 4- Build the Kafka producer for prediction response
  val producer = new KafkaProducer[ResponseMessage](
     ResponseSerDe.serializingClass, 
     produceTopic
  )
  .....
}

  1. We first need to create a wrapper function, transform to generate a prediction. The  function converts a request message of type RequestMessage into a prediction of type ResponseMessage.
  2. The wrapper, transform invoke the prediction pipeline predictionPipeline after converting the messages of type RequestMessage consumed from Kafka into actual request (Request). The predictions are converted into message of type ResponseMessage produced to Kafka
  3. The consumer is fully defined by the de-serialization of data consumed from Kafka and its associated topic
  4. The producer serialized the response back to Kafka service.
def executeBatch(
  consumeTopic: String, 
  produceTopic: String, 
  maxNumResponses: Int): Unit = { 
 
   // 1 - Initialize the prediction pipeline
 val kafkaHandler = new KafkaPrediction(
    consumeTopic, 
    produceTopic, 
    predictionPipeline
  )

  while(running)  {
      // 2 - Pool the request topic (has its own specific Kafka exception handler)
   val consumerRecords = kafkaHandler.consumer.receive
 
   if(consumerRecords.nonEmpty) {
        // 3 - Generate and apply transform to the batch
     val input: Seq[RequestMessage] = consumerRecords.map(_._2)
     val responses = kafkaHandler.predict(input) 
 
     if(responses.nonEmpty) {
         // 4 - Produce to the output topic
         val respMessages = responses.map(
             response =>(response.payload.id, response)
         ) 
 
         // 5- Produce the batch of response messages to Kafka
        kafkaHandler.producer.send(respMessages)
             
        // 6 - Get confirmation from Kafka has indeed processed the response
        kafkaHandler.consumer.asyncCommit
     }
     else
        logger.error("No response is produced to Kafka")
   }
   kafkaHandler.close
}
  1. First we instantiate the Kafka message handler class, KafkaPrediction we created earlier
  2. At regular interval, we pull a batch of new requests from Kafka
  3. If the batch is not empty, we invoke the handler, predict to the prediction models
  4. Once done, we encapsulate the predictions into the ResponseMessage instances
  5. The messages are produced into the producer topic in the Kafka queue 
  6. Finally, Kafka acknowledges the correct reception of the responses, asynchronously.
Next, we leverage Spark to distribute the batch of requests across multiple computation nodes (workers)


Apache Spark

Apache Spark, an open-source distributed processing system, is adept at handling large-scale data sets. It leverages in-memory caching and refined query execution strategies for real-time analytics [ref 3].

In our specific use case, we employ Spark to distribute a batch of requests, which are sourced from Kafka, across a network. This setup enables the simultaneous execution of multiple BERT models. Such an architecture not only prevents a single point of failure, ensuring fault tolerance, but also permits the use of generic, cost-efficient hardware.

Leveraging Spark data set and partitioning is surprisingly simple.

def predict(
   requests: Seq[Request]
)(implicit sparkSession: SparkSession): Seq[Prediction] = {
  import sparkSession.implicits._

    // 1 - Convert request into a Spark data set
  val requestDataset = requests.toDS()

    // 2 - Execute the prediction by invoking the DJL model
  val responseDataset: Dataset[Prediction] = requestDataset(predict(_))

    // 3 - Convert Spark data set response 
  responseDataset.collect() 
}
  1. Once the spark session (context) is initiated, the batch of requests is converted into a data set, requestDataset
  2. Spark applies the prediction model (DJL) on each request on the partitioned data 
  3. Finally, the predictions are collected from the Spark worker nodes before been returned to the Kafka handler

Note: The Spark context is assumed to be created and passed as implicit parameter to the prediction method.

 

Deep Java Library

This component is crucial as it connects the flow of incoming and outgoing data with the deep learning models. The Deep Java Library (DJL) is an open-source Java framework that accommodates popular deep learning frameworks like MXNet, PyTorch, and TensorFlow.

DJL's capability to adapt to any hardware setup (be it CPU or GPU) and its integration with big data frameworks position it as an ideal choice for a high-performance distributed inference engine [ref 4]. The library is particularly well-suited for constructing transformer encoders like BERT or GPT, as well as decoders such as GPT and ChatGPT.

In this setup, the input tensors are processed by the deep learning models on a GPU. Importantly, the data is allocated in the native memory space, which is external to the JVM and its garbage collector. The DJL library supports native tensor types such as NDArray and lists of tensors like NDList, along with a straightforward memory management tool, NDManager.

The classifier operates on the Spark worker node. The following code snippet, though a simplified version, illustrates the steps involved in invoking a BERT-based classifier using the DJL framework. 

class BERTClassifier(
   minTermFrequency: Int, 
   path: Path)(implicit sparkSession: SparkSession) {

  // 1 - Manage tensor allocation as NDArray
  val ndManager = NDManager.newManager()
 
  // 2 - Define the configuration of the classifier
  val classifyCriteria: Criteria[NDList, NDList] = Criteria.builder()
     .optApplication(Application.UNDEFINED)
     .setTypes(classOf[NDList], classOf[NDList])
     .optOptions(options)
     .optModelUrls(s"file://${path.toAbsolutePath}")
     .optBlock(classificationBlock)
     .optEngine(Engine.getDefaultEngineName())
     .optProgress(new ProgressBar())
     .build()
 
  // 3- Load the model from a local file
  val thisModel = classifyCriteria.loadModel()

  // 4 - Instantiate a new predictor
  val predictor = thisModel.newPredictor()

  // 5 - Execute this request on this worker node
  def predict(requests: Request): Prediction = {
    predictor.predict(ndManager, requests)
  }

  // 6- Close resources
  def close(): Unit = {
    model.close()
    predictor.close()
    ndManager.close()
  }
}  
  1. Set the manager for tensor in native memory
  2. Configure the classifier with its related neural block (classificationBlock)
  3. Load the model (MXNet, PyTorch or TensorFlow) from local file
  4. Instantiate a predictor from the model
  5. Submit the request to the DL model and return a prediction
  6. Close all the resources allocated in the native memory at the end of the run
NoteDJL can be optionally used for training. 


Use case: BERT

In order to illustrate the application of Spark and DJL to BERT we consider a model to predict a topic given a document. 

Architecture

Our model has 3 components:
  • Text processor (Tokenizer, Document segmentation,...)
  • Pre-trained BERT
  • Fully-connected neural network classifier (supervised)

A transformer model consists of two main components: an encoder and a decoder. The encoder's role is to convert sentences and paragraphs into an internal format, typically a numerical matrix, that captures the context of the input. Conversely, the decoder interprets and reverses this process. When combined, the encoder and decoder enable the transformer to execute sequence-to-sequence tasks like translation. Interestingly, isolating the encoder part of the transformer provides insights into the context, enabling various intriguing applications.

BERT particularly capitalizes on the attention mechanism to gain a more nuanced understanding of language context. BERT is composed of several layers of encoder blocks. In this model, the input text is divided into tokens, akin to the traditional transformer model, and each token is subsequently converted into a vector at BERT's output.

BERT has been applied to various problems including the automation of medical coding [ref 5]

Neural blocks

The practice of arranging components of neural networks, such as layers and activation functions, into modular, reusable blocks is a common strategy to simplify and deconstruct complex models [ref 6] .

In DJL, a block is a composable function that forms a neural network. It can represent single operation, parts of a neural network, and even the whole neural network. What makes blocks special is that they contain a number of parameters that are used in their function and are trained during deep learning. As these parameters are trained, the functions represented by the blocks get more and more accurate.

The core purpose of a block is to perform an operation on the inputs, and return an output. It is defined in the forward method. The forward function could be defined explicitly in terms of parameters or implicitly and could be a combination of the functions of the child blocks. 

The following code snippet illustrates the composition of blocks for a transformer encoder using Deep Java Library blocks. The 3 main components are
  • Transformer, self-attention block with token, position and sentence order embeddings
  • Masked Language Model (MLM) block
  • Next Sentence Prediction (NSP) block
class CustomPretrainingBlock (
  bertModelType: String
  activationType: String,
  vocabularySize: Long) extends BaseNetBlock {
 
  // First block: BERT transformer
  val bertBlock = getBertConfig(bertModelType)
        .setTokenDictionarySize(Math.toIntExact(vocabularySize))
        .build
  val activationFunc: java.util.function.Function[NDArray, NDArray] = 
         ActivationConfig.getNDActivationFunc(activationType)

    // Second block: Masked Language Model
  val bertMLMBlock = new BertMaskedLanguageModelBlock(bertBlock, activationFunc)

   // Third: block: Next Sentence Predictor
  val bertNSPBlock = new BertNextSentenceBlock
  val pretrainingBlocks = new BERTPretrainingBlocks(
      ("transformer", bertBlock),
      ("mlm", bertMLMBlock),
      ("nsp", bertNSPBlock)
   )

  override protected def forwardInternal(
    parameterStore: ParameterStore,
    inputNDList: NDList,
    training : Boolean,
    params: PairList[String, java.lang.Object]): NDList

BERT has several models with various number of encoder blocks, attention heads, embedding sizes and dimensions.

def getBertConfig(bertModelType: String): BertBlock.Builder = bertModelType match {
  case `nanoBertLbl` => 
      // 4 encoders, 4 attention heads, embedding size: 256, dimension 256x4
    BertBlock.builder().nano()
  
  case `microBertLbl`=>
      // 12 encoders,8 attention heads, embedding size: 512, dimension 512x4
    BertBlock.builder().micro()
  
  case `baseBertLbl` =>
      // 12 encoders,12 attention heads, embedding size: 768, dimension 768x4
    BertBlock.builder().base()
  
  case `largeBertLbl` =>
      // 24 encoders,16 attention heads, embedding size: 1024, dimension 1024x4
    BertBlock.builder().large()
  
  case _ =>
}

The appendix provides a detailed implementation guide for executing the 'forward' method used in pre-training, written in Scala, for reference purposes.



Thank you for reading this article. For more information ...

References

[1BiDirectional Encoder Representations from Transformer 

Appendix

You can implement your own variant of BERT by overriding the method forwardInternal.

override protected def forwardInternal(
  parameterStore: ParameterStore,
  inputNDList: NDList,
  training : Boolean,
  params: PairList[String, java.lang.Object]): NDList = {

    // Dimension batch_size x max_sentence_size
  val tokenIds = inputNDList.get(0)
  val typeIds = inputNDList.get(1)
  val inputMasks = inputNDList.get(2)

    // Dimension batch_size x num_masked_token
  val maskedIndices = inputNDList.get(3)

  try {
    val ndChildManager = NDManager.subManagerOf(tokenIds)
    ndChildManager.tempAttachAll(inputNDList)

      // Step 1: Process the transformer block for Bert
    val bertBlockNDInput = new NDList(tokenIds, typeIds, inputMasks)
    val ndBertResult = transformerBlock.forward(parameterStore, bertBlockNDInput, training)

      // Step 2 Process the Next Sentence Predictor block
      // Embedding sequence dimensions are batch_size x max_sentence_size x embedding_size
    val embeddedSequence = ndBertResult.get(0)
    val pooledOutput = ndBertResult.get(1)

      // Need to un-squeeze for batch size =1,   (embedding_vector) => (1, embedding_vector)
    val unSqueezePooledOutput =
      if(pooledOutput.getShape.dimension() == 1) {
         val expanded = pooledOutput.expandDims(0)
         ndChildManager.tempAttachAll(expanded)
         expanded
      }
      else
         pooledOutput

      // We compute the NSP probabilities in case there are more than one single sentences
    val logNSPProbabilities: NDArray =
       bertNSPBlock.forward(parameterStore, new NDList(unSqueezePooledOutput), training)
                 .singletonOrThrow

        // Step 3: Process the Masked Language Model block
        // Embedding table dimension are vocabulary_size x Embeddings size
    val embeddingTable = transformerBlock
            .getTokenEmbedding
            .getValue(parameterStore, embeddedSequence.getDevice, training)

        // Dimension:  (batch_size x maskSize) x Vocabulary_size
    val logMLMProbabilities: NDArray = bertMLMBlock
        .forward(
           parameterStore,
           new NDList(embeddedSequence, maskedIndices, embeddingTable),
           training)
        .singletonOrThrow

        // Finally build the output
    val ndOutput = new NDList(logNSPProbabilities, logMLMProbabilities)
      ndChildManager.ret(ndOutput)
  }
  catch { ... }
}
  



---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Sunday, August 20, 2023

Automate Medical Coding Using BERT

Target audience: Beginner
Estimated reading time: 5'
Transformers and self-attention models are increasingly taking center stage in the NLP toolkit of data scientists [ref 1]. This article delves into the design, deployment, and assessment of a specialized transformer tasked with extracting medical codes from Electronic Health Records (EHR) [ref 2]. The focus is on curbing development and training expenses while ensuring the model remains current.


Table of contents
Introduction
       Extracting medical codes

       Minimizing costs

       Keeping models up-to-date

Architecture

Tokenizer

BERT encoder

       Context embedding

       Segmentation

       Transformer

       Self-attention

Classifier

Active learning

References


Follow me on LinkedIn
Important notes
  • This piece doesn't serve as a primer or detailed account of transformer-based encoders,  Bidirectional Encoder Representations from Transformers (BERT), multi-label classification or active learning. Detailed and technical information on these models is available in the References section. [ref 1, 3, 8, 12]. 
  • The terms medical document, medical note and clinical notes are used interchangeably
  • Some functionalities discussed here are protected intellectual property, hence the omission of source code.


Introduction

Autonomous medical coding refers to the use of artificial intelligence (AI) and machine learning (ML) technologies to automatically assign medical codes to patient records [ref 4]. Medical coding is the process of assigning standardized codes to diagnoses, medical procedures, and services provided during a patient's visit to a healthcare facility. These codes are used for billing, reimbursement, and research purposes.


By automating the medical coding process, healthcare organizations can improve efficiency, accuracy, and consistency, while also reducing costs associated with manual coding.

 

A health insurance claim is an indication of the service given by a provider, even though the medical records associated with this service can greatly vary in content and structure. It's crucial to precisely extract medical codes from clinical notes since outcomes, like hospitalizations, treatments, or procedures, are directly tied to these diagnostic codes. Even if there are minor variations in the codes, claims can still be valid for specific services, provided the clinical notes, patient history, diagnosis, and advised procedures align.


fig. 1 Extraction of knowledge, predictions from electronic medical records 

Medical coding is the transformation of healthcare diagnosis, procedures, medical services described in electronic health records, physician's notes or laboratory results into alphanumeric codes.  This study focuses on automated generation of medical codes and health insurance claims from a given clinical note or electronic health record.

Challenges

There are 3 issues to address:
  1. How to extract medical codes reliably, given that labeling of medical codes is error prone and the clinical documents are very inconsistent?
  2. How to minimize the cost of self- training complex deep models such as transformers while preserving an acceptable accuracy?
  3. How to continuously keep models up to date in production environment?

Extracting medical codes

Medical codes are derived from patient records and clinical notes to forecast procedural results, determine the length of hospital stays, or generate insurance claims. The most prevalent medical coding systems include:
  • International Classification of Diseases (ICD-10) for diagnosis (with roughly 72,000 codes)
  • Current Procedural Terminology (CPT) for procedures and medications (encompassing around 19,000 codes)
  • Along with others like Modifiers, SNOMED, and so forth.
The vast array of medical codes poses significant challenges in extraction due to:
  • The seemingly endless combinations of codes linked to a specific medical document
  • Varied and inconsistent formats of patient records (in terms of terminology, structure, and length.
  • Complications in gleaning context from medical information systems.

Minimizing costs

A study on deep learning models suggests that training a significant language model (LLM) results in the emission of 626,155 pounds of CO2, comparable to the total emissions from five vehicles over their lifespan.

To illustrate, GPT-3/ChatGPT underwent training on 500 billion words with a model size of 175 billion parameters. A single training session would require 355 GPU-years and bear a cost of no less than $4.6M. Efforts are currently being made to fine-tune resource utilization for the development of upcoming models [ref 5].

Keeping models up-to-date

Customer data in real-time is continuously changing, often deviating from the distribution patterns the models were originally trained on (due to concept and covariate shifts).
This challenge is particularly pronounced for transformers that need task-specific fine-tuning and might even necessitate restarting the pre-training process — both of which are resource-intensive actions.

Architecture

To tackle the challenges highlighted earlier, the proposed solution should encompass four essential AI/NLP elements:
  • Tokenizer to extract tokens, segments & vocabulary from a corpus of medical documents.
  • Bidirectional Encoder Representations from Transformers (BERT) to generate a representation (embedding) of the documents [ref 3].
  • Neural-based classifier to predict a set of diagnostic codes or insurance claim given the embeddings.
  • Active/transfer learning framework to update model through optimized selection/sampling of training data from production environment.
From a software engineering perspective, the system architecture should provide a modular integration capability with current IT infrastructures. It also requires an asynchronous messaging system with streaming capabilities, such as Kafka, and REST API endpoints to facilitate testing and seamless production deployment.

fig. 2  Architecture for integration of AI components with external medical IT systems 


Tokenizer 

The effectiveness of a transformer encoder's output hinges on the quality of its input: tokens and segments or sentences derived from clinical documents. Several pressing questions need addressing:

  1. Which vocabulary is most suitable for token extraction from these notes? Do we consider domain-specific terms, abbreviations, Tf-Idf scores, etc.?
  2. What's the best approach to segmenting a note into coherent units, such as sections or sentences?
  3. How do we incorporate or embed pertinent contextual data about the patient or provider into the encoder?
Tokens play a pivotal role in formulating a dynamic vocabulary. This vocabulary can be enriched by incorporating words or N-grams from various sources like:
  • Terminology from the American Medical Association (AMA)
  • Common medical terms with high TF-IDF scores
  • Different senses of words
  • Abbreviations
  • Semantic descriptions
  • Stems
  • .....

fig. 3 Generation of a vocabulary using training corpus and knowledge base

Our optimal approach is based on utilizing uncased words from the American Medical Association, coupled with the top 85% of terms derived from training medical notes, ranked by their highest TF-IDF scores. It's worth noting that this method can be resource-intensive.

BERT encoder

In NLP, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations [ref 6].
The objective is to generate embeddings for medical documents including contextual data to be feed into a deep learning classifier to extract diagnostic codes or generate a medical insurance claim [ref 7].

Context embedding 

Contextual information such as patient data (age, gender,...), medical service provider, specialty, or location is categorized (or bucked for continuous values) and added to the tokens extracted from the medical note. 

Segmentation

Structuring electronic health records into logical or random groups of segments/sentences presents a significant challenge. Segmentation involves dividing a medical document into segments (or sections), each with an equal number of tokens that consist of sentences and relevant contextual data.

Several methods can be employed to segment a document:
  1. Isolating the contextual data as a standalone segment.
  2. Integrating the contextual data into the document's initial segment.
  3. Embedding the contextual data into any arbitrarily chosen segment [Ref 6].

fig. 4 Embedding of medical note with contextual data using 2 segments


Our study show the option 2 provides the best embedding for the feed forward neural network classifier.
Interestingly, treating the entire note as a single sentence and using the AMA vocabulary leads to diminished accuracy in subsequent classification tasks.

Transformer

We employ the self-supervised Bidirectional Representation for Transformer (BERT) with the objectives to:
  • Grasp the contextual significance of medical phrases.
  • Create embeddings/representations that merge clinical notes with contextual data.
The model construction involves two phases:
  1. Pretraining on an extensive, domain-specific corpus [ref 8].
  2. Fine-tuning tailored for specific tasks, like classification [ref 9].

After the pretraining phase concludes, the document embedding is introduced to the classifier training. This can be sourced:
  1. Directly from the output of the pretrained model (document embeddings).
  2. During the fine-tuning process of the pretrained model. Concurrently, fine-tuning operates alongside active learning for model updates."\


fig. 5 Model weights update with features extraction vs fine tuning

It's strongly advised to utilize one of the pretrained BERT models like ClinicalBERT [ref 10] or GatorTron [ref 11], and then adapt the transformer for classification purposes. However, for this particular project, we initiated BERT's pretraining on a distinct set of clinical notes to gauge the influence of vocabulary and segmentation on prediction accuracy.


Self-attention

Here's a concise overview of the multi-head self-attention model for context:
The foundation of a transformer module is the self-attention block that processes token, position, and type embeddings prior to normalization. Multiple such modules are layered together to construct the encoder. A similar architecture is employed for the decoder.


fig. 6 Schematic for transformer encoder block

Classifier

The classifier is structured as a straightforward feed-forward neural network (fully connected), since a more intricate design might not considerably enhance prediction accuracy. In addition to the standard hyper-parameter optimization, different network configurations were assessed.
The network's structure, including the number and dimensions of hidden layers, doesn't have a significant influence on the overall predictive performance.


Active learning

The goal is to modify models to tackle the issue of covariate shifts observed in the distribution of real-time/production data during inference.

The dual-faceted approach involves:
  1. Selecting data samples with labels that deviate from the distribution initially employed during training (Active learning) [ref 12].
  2. Adjusting the transformer for the classification objective using these samples (Transfer learning)
A significant obstacle in predicting diagnostic codes or medical claims is the steep labeling expense. In this context, learning algorithms can proactively seek labels from domain experts. This iterative form of supervised learning is known as active learning.
Because the learning algorithm selectively picks the examples, the quantity of samples needed to grasp a concept is frequently less than that required in traditional supervised learning. In this aspect, active learning parallels optimal experimental design, a standard approach in data analysis [ref 13].


fig. 6 Simplified data pipeline for active learning.

In our scenario, the active learning algorithm picks an unlabeled medical note, termed note-91, and sends it to a human coder who assigns it the diagnostic code S31.623A. Once a substantial number of notes are newly labeled, the model undergoes retraining. Subsequently, the updated model is rolled out and utilized to forecast diagnostic codes on notes in production.

Thank you for reading this article. For more information ...

References


A formal presentation of this project is available at


Glossary

  • Electronic health record (EHR):  An Electronic version of a patients medical history, that is maintained by the provider over time, and may include all of the key administrative clinical data relevant to that persons care under a particular provider, including demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, laboratory data and radiology reports.
  • Medical document: Any medical artifact related to the health of a patient. Clinical note, X-rays, lab analysis results,...
  • Clinical note: Medical document written by physicians following a visit. This is a textual description of the visit, focusing on vital signs, diagnostic, recommendation and follow-up.
  • ICD (International Classification of Diseases):  Diagnostic codes that serve a broad range of uses globally and provides critical knowledge on the extent, causes and consequences of human disease and death worldwide via data that is reported and coded with the ICD. Clinical terms coded with ICD are the main basis for health recording and statistics on disease in primary, secondary and tertiary care, as well as on cause of death certificates
  • CPT (Current Procedural Terminology):  Codes that offer health care professionals a uniform language for coding medical services and procedures to streamline reporting, increase accuracy and efficiency. CPT codes are also used for administrative management purposes such as claims processing and developing guidelines for medical care review.


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3