Showing posts with label Active learning. Show all posts

Sunday, August 20, 2023

Automate Medical Coding Using BERT

Target audience: Beginner
Estimated reading time: 5'

Posts history

Transformers and self-attention models are increasingly taking center stage in the NLP toolkit of data scientists [ref 1]. This article delves into the design, deployment, and assessment of a specialized transformer tasked with extracting medical codes from Electronic Health Records (EHR) [ref 2]. The focus is on curbing development and training expenses while ensuring the model remains current.

Introduction

Challenges

Extracting medical codes

Minimizing costs

Keeping models up-to-date

Important notes:

This piece doesn't serve as a primer or detailed account of transformer-based encoders, Bidirectional Encoder Representations from Transformers (BERT), multi-label classification or active learning. Detailed and technical information on these models is available in the References section. [ref 1, 3, 8, 12].
The terms medical document, medical note and clinical notes are used interchangeably
Some functionalities discussed here are protected intellectual property, hence the omission of source code.

Introduction

Autonomous medical coding refers to the use of artificial intelligence (AI) and machine learning (ML) technologies to automatically assign medical codes to patient records [ref 4]. Medical coding is the process of assigning standardized codes to diagnoses, medical procedures, and services provided during a patient's visit to a healthcare facility. These codes are used for billing, reimbursement, and research purposes.

By automating the medical coding process, healthcare organizations can improve efficiency, accuracy, and consistency, while also reducing costs associated with manual coding.

A health insurance claim is an indication of the service given by a provider, even though the medical records associated with this service can greatly vary in content and structure. It's crucial to precisely extract medical codes from clinical notes since outcomes, like hospitalizations, treatments, or procedures, are directly tied to these diagnostic codes. Even if there are minor variations in the codes, claims can still be valid for specific services, provided the clinical notes, patient history, diagnosis, and advised procedures align.

fig. 1 Extraction of knowledge, predictions from electronic medical records

Medical coding is the transformation of healthcare diagnosis, procedures, medical services described in electronic health records, physician's notes or laboratory results into alphanumeric codes. This study focuses on automated generation of medical codes and health insurance claims from a given clinical note or electronic health record.

Challenges

There are 3 issues to address:

How to extract medical codes reliably, given that labeling of medical codes is error prone and the clinical documents are very inconsistent?
How to minimize the cost of self- training complex deep models such as transformers while preserving an acceptable accuracy?
How to continuously keep models up to date in production environment?

Extracting medical codes

Medical codes are derived from patient records and clinical notes to forecast procedural results, determine the length of hospital stays, or generate insurance claims. The most prevalent medical coding systems include:

International Classification of Diseases (ICD-10) for diagnosis (with roughly 72,000 codes)
Current Procedural Terminology (CPT) for procedures and medications (encompassing around 19,000 codes)
Along with others like Modifiers, SNOMED, and so forth.

The vast array of medical codes poses significant challenges in extraction due to:

The seemingly endless combinations of codes linked to a specific medical document
Varied and inconsistent formats of patient records (in terms of terminology, structure, and length.
Complications in gleaning context from medical information systems.

Minimizing costs

A study on deep learning models suggests that training a significant language model (LLM) results in the emission of 626,155 pounds of CO2, comparable to the total emissions from five vehicles over their lifespan.

To illustrate, GPT-3/ChatGPT underwent training on 500 billion words with a model size of 175 billion parameters. A single training session would require 355 GPU-years and bear a cost of no less than $4.6M. Efforts are currently being made to fine-tune resource utilization for the development of upcoming models [ref 5].

Keeping models up-to-date

Customer data in real-time is continuously changing, often deviating from the distribution patterns the models were originally trained on (due to concept and covariate shifts).

This challenge is particularly pronounced for transformers that need task-specific fine-tuning and might even necessitate restarting the pre-training process — both of which are resource-intensive actions.

Architecture

To tackle the challenges highlighted earlier, the proposed solution should encompass four essential AI/NLP elements:

Tokenizer to extract tokens, segments & vocabulary from a corpus of medical documents.
Bidirectional Encoder Representations from Transformers (BERT) to generate a representation (embedding) of the documents [ref 3].
Neural-based classifier to predict a set of diagnostic codes or insurance claim given the embeddings.
Active/transfer learning framework to update model through optimized selection/sampling of training data from production environment.

From a software engineering perspective, the system architecture should provide a modular integration capability with current IT infrastructures. It also requires an asynchronous messaging system with streaming capabilities, such as Kafka, and REST API endpoints to facilitate testing and seamless production deployment.

fig. 2 Architecture for integration of AI components with external medical IT systems

Tokenizer

The effectiveness of a transformer encoder's output hinges on the quality of its input: tokens and segments or sentences derived from clinical documents. Several pressing questions need addressing:

Which vocabulary is most suitable for token extraction from these notes? Do we consider domain-specific terms, abbreviations, Tf-Idf scores, etc.?
What's the best approach to segmenting a note into coherent units, such as sections or sentences?
How do we incorporate or embed pertinent contextual data about the patient or provider into the encoder?

Tokens play a pivotal role in formulating a dynamic vocabulary. This vocabulary can be enriched by incorporating words or N-grams from various sources like:

Terminology from the American Medical Association (AMA)
Common medical terms with high TF-IDF scores
Different senses of words
Abbreviations
Semantic descriptions
Stems
.....

fig. 3 Generation of a vocabulary using training corpus and knowledge base

Our optimal approach is based on utilizing uncased words from the American Medical Association, coupled with the top 85% of terms derived from training medical notes, ranked by their highest TF-IDF scores. It's worth noting that this method can be resource-intensive.

BERT encoder

In NLP, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations [ref 6].

The objective is to generate embeddings for medical documents including contextual data to be feed into a deep learning classifier to extract diagnostic codes or generate a medical insurance claim [ref 7].

Context embedding

Contextual information such as patient data (age, gender,...), medical service provider, specialty, or location is categorized (or bucked for continuous values) and added to the tokens extracted from the medical note.

Segmentation

Structuring electronic health records into logical or random groups of segments/sentences presents a significant challenge. Segmentation involves dividing a medical document into segments (or sections), each with an equal number of tokens that consist of sentences and relevant contextual data.

Several methods can be employed to segment a document:

Isolating the contextual data as a standalone segment.
Integrating the contextual data into the document's initial segment.
Embedding the contextual data into any arbitrarily chosen segment [Ref 6].

fig. 4 Embedding of medical note with contextual data using 2 segments

Our study show the option 2 provides the best embedding for the feed forward neural network classifier.

Interestingly, treating the entire note as a single sentence and using the AMA vocabulary leads to diminished accuracy in subsequent classification tasks.

Transformer

We employ the self-supervised Bidirectional Representation for Transformer (BERT) with the objectives to:

Grasp the contextual significance of medical phrases.
Create embeddings/representations that merge clinical notes with contextual data.

The model construction involves two phases:

Pretraining on an extensive, domain-specific corpus [ref 8].
Fine-tuning tailored for specific tasks, like classification [ref 9].

After the pretraining phase concludes, the document embedding is introduced to the classifier training. This can be sourced:

Directly from the output of the pretrained model (document embeddings).
During the fine-tuning process of the pretrained model. Concurrently, fine-tuning operates alongside active learning for model updates."\

fig. 5 Model weights update with features extraction vs fine tuning

It's strongly advised to utilize one of the pretrained BERT models like ClinicalBERT [ref 10] or GatorTron [ref 11], and then adapt the transformer for classification purposes. However, for this particular project, we initiated BERT's pretraining on a distinct set of clinical notes to gauge the influence of vocabulary and segmentation on prediction accuracy.

Self-attention

Here's a concise overview of the multi-head self-attention model for context:
The foundation of a transformer module is the self-attention block that processes token, position, and type embeddings prior to normalization. Multiple such modules are layered together to construct the encoder. A similar architecture is employed for the decoder.

fig. 6 Schematic for transformer encoder block

Classifier

The classifier is structured as a straightforward feed-forward neural network (fully connected), since a more intricate design might not considerably enhance prediction accuracy. In addition to the standard hyper-parameter optimization, different network configurations were assessed.
The network's structure, including the number and dimensions of hidden layers, doesn't have a significant influence on the overall predictive performance.

Active learning

The goal is to modify models to tackle the issue of covariate shifts observed in the distribution of real-time/production data during inference.

The dual-faceted approach involves:

Selecting data samples with labels that deviate from the distribution initially employed during training (Active learning) [ref 12].
Adjusting the transformer for the classification objective using these samples (Transfer learning)

A significant obstacle in predicting diagnostic codes or medical claims is the steep labeling expense. In this context, learning algorithms can proactively seek labels from domain experts. This iterative form of supervised learning is known as active learning.

Because the learning algorithm selectively picks the examples, the quantity of samples needed to grasp a concept is frequently less than that required in traditional supervised learning. In this aspect, active learning parallels optimal experimental design, a standard approach in data analysis [ref 13].

fig. 6 Simplified data pipeline for active learning.

In our scenario, the active learning algorithm picks an unlabeled medical note, termed note-91, and sends it to a human coder who assigns it the diagnostic code S31.623A. Once a substantial number of notes are newly labeled, the model undergoes retraining. Subsequently, the updated model is rolled out and utilized to forecast diagnostic codes on notes in production.

Thank you for reading this article. For more information ...

References

[1] Attention is all you need

[2] What is medical coding?

[3] Towards data science: BERT Explained: State of the art language model for NLP

[4] Automated clinical coding: what, why, and where we are?

[5] Mitigating the Water Consumption and Carbon Emissions of ChatGPT

[6] Towards data science: What Is Embedding and What Can You Do with It

[7] A Guide on Word Embeddings in NLP

[8] Hugging face: State Of The Art NLP Model Explained
[9] What Is Fine-Tuning and How Does It Work in Neural Networks?

[10] ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
[11] A Large Language Model for Electronic Health Records
[12] Towards data science: Active Learning in Machine Learning

[13] Active learning strategies

A formal presentation of this project is available at

Autonomous Medical Coding with Discriminative Transformers

Glossary

Electronic health record (EHR): An Electronic version of a patients medical history, that is maintained by the provider over time, and may include all of the key administrative clinical data relevant to that persons care under a particular provider, including demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, laboratory data and radiology reports.
Medical document: Any medical artifact related to the health of a patient. Clinical note, X-rays, lab analysis results,...
Clinical note: Medical document written by physicians following a visit. This is a textual description of the visit, focusing on vital signs, diagnostic, recommendation and follow-up.
ICD (International Classification of Diseases): Diagnostic codes that serve a broad range of uses globally and provides critical knowledge on the extent, causes and consequences of human disease and death worldwide via data that is reported and coded with the ICD. Clinical terms coded with ICD are the main basis for health recording and statistics on disease in primary, secondary and tertiary care, as well as on cause of death certificates
CPT (Current Procedural Terminology): Codes that offer health care professionals a uniform language for coding medical services and procedures to streamline reporting, increase accuracy and efficiency. CPT codes are also used for administrative management purposes such as claims processing and developing guidelines for medical care review.

---------------------------

Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3

Tuesday, May 10, 2022

Explore Data Centric A.I.

Target audience: Intermediate

Estimated reading time: 4'

Posts history

Data-Centric AI (DCAI) is a burgeoning domain focused on methods and models that prioritize the selection, monitoring, and enhancement of datasets for the training, validation, and testing of established machine learning models [ref 1]. Historically, the data science community has placed undue emphasis on refining models, often overlooking the paramount importance of data quality during both training and inference phases. As the adage goes, "garbage in, garbage out" [ref 2].

Challenges

Data distribution shift

Detecting data inconsistency

Active learning for data distribution shift

Relabeling

Consensus labeling

Dawid-Skene aggregator

Confident learning

Encoding human priors

Class rebalancing

Data augmentation

Under-sampling and over-sampling

Class weighting

Ensemble methods

Cost-sensitive learning

One-vs-all strategy

Transfer learning

What you will learn: How to select, analyze, re-balance training data sets and create or correct labels for building robust ML models.

The management of quality of input data and labels has been regarded as an afterthought driven more by intuition than a strict engineering process. However, researchers have recently introduced techniques that turn improvement of data into an engineering discipline [ref 3].

This post lists some of the important challenges in understanding and managing quality of feedback/annotations and input data once a trained and validated model is deployed in production.

Challenges

In this post we assume that the nature or distribution of data used in training or evaluating a machine learning model may change overtime. It is not uncommon for the quality of a trained and validated model to degrade during inference because the distribution of the production data shift overtime.

Let's consider a simple data pipeline for a multi-classification model.

Illustration of basic labeled data flow

Data scientists and engineers go through the process of training, fine-tuning, and validating a model using data that has undergone selection, sampling, cleaning, preprocessing, and annotation, accomplished through various means including crowd sourcing and the expertise of domain specialists.

When the model meets the established quality metrics and fits within resource limitations, it is then deployed into a production environment, where it encounters numerous challenges:

How the model fares against real-world, production data and feedback?
Has the distribution of input data in production shifted over a period?
How feedback if available differ from the original annotations?
Has the data preparation and cleansing accounted for every category of outliers?
Were the assumptions regarding end user skills and basic knowledge correct?
Was the training data set annotated with a formal consensus and absent of bias?
Are the human priors, if any injected into a model, still valid in production?
Does the current data pipeline comply with regularization and privacy requirements?

Let's review some of the key issues that may arise once users exercise the model.

Data distribution shift

Detecting data inconsistency

The different types of data shift can be derived from the Naive Bayes formula [ref 4].

\[p(y|x) = \frac{p(x|y).p(y)}{p(x))}\]

Let's consider production or test data x and label y and a trained discriminative model that predict class y, p(y|x).

Covariate shift occurs when input, p(x) changes between training and test/production but the mapping input - label, p(y|x) does not.
Concept shift occurs when p(y|x) changes between training and production, but input p(x) distribution does not.
Prior probability shift (or label shift) occurs when p(y) changes between training and production, but p(x|y) does not [ref 4].

If the data distribution in the test or production environments differs from that in the training and validation sets, follow these steps:

Regularly create test sets by sampling from production data.
Incorporate these new test sets into the original training dataset, and then partition it into a fresh training-test set.
Calculate the error metrics (such as Accuracy, Precision, F1 score, etc.) on the training sets, the combined train-test sets, and the standalone test sets.

Train error	1%	1%	10%	10%
Train-test error	9%	2%	11%	11%
Test error	10%	10%	12%	20%
Diagnostic	Overfitting High variance	Distribution shift	Under-fitting High bias	High bias Distribution shift

In this scenario, the notable difference in error rates between the train-test and the test data clearly indicates a distribution shift or mismatch between the data used for training the model and the data encountered in production. Active learning is a widely used method to tackle this issue.

Active learning for data distribution shift

Inadequate model predictions could stem from the use of training labels that aren't applicable to the current data distribution. For example, a self-driving vehicle model trained predominantly in urban settings would require newly labeled data from rural and less dense environments.

The expense of acquiring new labels and retraining the existing model can be substantial. Active Learning, also known as optimal experimental design in the field of statistics, is a semi-supervised approach that can lessen the need for labeled data in training a model for a semi-supervised learning problem [ref 5].

There are 3 architectures to generate/update labels:

Membership query synthesis: This scenario applies to small data sets. It selects samples to be annotated from the shifted distribution.
Stream-based selective sampling: Real-time data is analyzed as candidate for annotation.
Pool-based sampling: The common technique selects a data point/instance form a pool of unlabeled data, estimates the confidence factor then annotates the data point with the highest information gain (or least confidence).

The objective of any sampling method is to identify unlabeled data that are:

Near a decision boundary (Uncertainty or exploitation sampling)
Unknown or rarely seen by the current model (diversity or exploration sampling)

There are several algorithms for selecting candidate data points for annotation from a pool of unlabeled data. Among them:

Variance reduction: Prioritize data points with the highest variance to converge toward a mean.
Contextual multi-armed bandit: Pick data points which maximizes reward expectation (UCB-1).
Random sampling: Select data points independently from their entropy.
Entropy maximization: Prioritize data points with the highest uncertainty.
Margin selection: Prioritize data points with the smallest margin between the two classes/labels with the highest probability (support vector machine).

Relabeling

Techniques to address incorrect labeling or annotating of data abound. Preventing, finding and correcting errors in labeling a large dataset is not a trivial task even for experts.
Here are some of the most interesting approaches to either produce valid labels or correct wrong labels.

Consensus labeling

Consensus labeling is the simplest technique to validate labels from a group of annotators, domain experts. In case of multiple annotators, we need to estimate the following:

Consensus label for each example: Collect the labels with consensus among annotators, whenever possible.
Quality score for each consensus label: Estimate the % annotators who select a given label for an input data set.
Quality score for each annotator: % of labels selected by a given annotator which agree with consensus.

Dawid-skene aggregator

The Dawid-Skene aggregation model is a statistical model that uses confusion matrices to characterize the skill level of annotators. It employs an expectation-maximization (EM) algorithm to determine the probability of errors made by annotators, based on their annotations and the probabilities of the true (correct) labels [ref 6, 7]. Utilizing this method necessitates a solid understanding of statistical inference.

Confident learning

Confident learning is a data-centric approach that focuses on label quality by identifying label errors in datasets. This technique enables data scientists to

Characterize label noise.
Find label errors.
Learn with noisy labels.

The objective is to estimate the join distribution between the noisy, observed labels and the true latent labels [ref 8]. The method estimates the ratio of wrong positive and negative labels. The noise in actual labels is estimated by learning from uncorrupted labels in ideal conditions.

Encoding human priors

Incorporating human priors either domain knowledge or data scientist assumptions into a model goes well beyond the Bayes prior (class probability) p(C). It may encompass selection of neural architecture (convolution, recurrence, ...), domain related facts, and first order logic.

The most commonly applied method to inject domain knowledge into a model is knowledge distillation (teacher-student networks in deep learning).

Class rebalancing

Training classification models often encounters a noted challenge: class imbalance. This happens when there's a disproportionate distribution of classes in the dataset, potentially causing bias in the model's outcome. This phenomenon is particularly evident in binary classification. For instance, a diagnostic model analyzing doctor visit summaries could lean heavily towards detecting prevalent diseases.

Interestingly, many data professionals might overlook that class imbalances can also emerge in a live production setting.

Let's explore methods to tackle this imbalance issue.

Data augmentation

This technique involves creating artificial data based on your existing data set. This could be as simple as creating copies with small alterations. For text data, common methods include synonym replacement, random insertion, random deletion, and sentence shuffling.

Feature vs data space augmentation:

Feature space augmentation modifies data in the embedding or representation space.
Data space augmentation modifies raw text data.

Example of augmentation using BERT:
The objective is to duplicate a record (sequence of tokens) and randomly replace a token by ‘[UNK]’. The original and duplicate (or augmented) records associated with the same label are fine-tuned (or optionally pre-trained) using the same loss function and MLM model.
Here are some examples of easy steps for augmentation

Replace abbreviations (i.e., don’t with do not and vice versa)
Replace with synonyms.
Remove out of vocabulary terms from original note.

Under-sampling and Over-sampling

Under-sampling involves randomly discarding examples from the majority class, while oversampling involves duplicating examples from the minority class. However, under-sampling may lead to information loss, while oversampling can lead to overfitting. A more sophisticated oversampling method is the Synthetic Minority Over-sampling Technique (SMOTE), but it's mostly applied on numerical data and might not be directly applicable to text data [ref 9].

Class weighting

Class weighting in supervised learning is a technique used to address imbalances in the distribution of different classes within a training dataset. This imbalance can lead to biased models that perform well on the majority class but poorly on the minority class.

Here's a deeper dive into the concept:

Imbalance Issue: In many real-world datasets, some classes are underrepresented compared to others (i.e. target advertising, medical diagnostics)
Impact on Model: Machine learning models trained on such imbalanced datasets tend to be biased towards the majority class, often at the cost of misclassifying the minority class.

The primary goal of class weighting is to make the model pay more attention to the minority class during the training process by assigning a higher weight to the minority class and a lower weight to the majority class. These weights are used during the calculation of the loss function in the training process.

One noticeable drawback is the increase in complexity in model tuning.

Ensemble methods

This approach involves training multiple models and having them vote on the output. This can often lead to improved results, as different models might make different errors, and the ensemble can often make better decisions.

Ensemble learning is a powerful technique in machine learning that involves combining multiple models to improve the overall performance of a predictive task especially in the case of imbalanced class. This approach is based on the principle that a group of weak learners, when properly combined, can outperform a single strong learner.

These methods used multiple, diverse weak learners to increase the accuracy of prediction.

The most common techniques in ensemble learning are:

Bagging: This involves training multiple models on different subsets of the training data, sampled with replacement (bootstrap samples).
Boosting: This method trains learners sequentially with each new model focuses on the errors made by the previous ones in an attempt to correct them. The final prediction is a weighted sum of the predictions made by each model.
Stacking: This involves training multiple models on the same data and then training a new model to aggregate their predictions.

Cost-sensitive learning

In cost-sensitive learning, misclassification costs are incorporated into the decision process. The costs are usually set inversely proportional to the class frequencies. It helps to focus more on minority classes.

In many real-world scenarios, the cost of misclassifying one class of data may be much higher than misclassifying another. Cost-sensitive learning involves modifying the learning algorithm so that it minimizes a weighted sum of errors associated with each class [ref 10].

One-vs-all (OvA) strategy

The one-vs-all strategy is a technique used in machine learning for multi-class classification problems dealing with datasets where one class is significantly underrepresented compared to others.

The concept consists of breaking down a multi-class classification problem into multiple binary classification problems. For each class in the dataset, a separate binary classifier is trained. This classifier distinguishes between the class under consideration (positive case) and all other classes (negative case).

By treating the minority class as the 'positive' case in one of the binary classification techniques described in previous sections such as class weighting or oversampling, can be more effectively applied [ref 11].

Transfer learning

Transfer learning is a technique where a model developed for one task is reused as the starting point for a model on a second task. Transfer learning typically involves using a model pre-trained on a large and diverse dataset [ref 12].

This is a sequential process used when:

The model has been originally trained on large data set and need to update with a different new smaller dataset.
The training tasks use the same medium (text, image, audio …).
Low level features from the original model can be reused for the new model.

These models have learned rich feature representations that can be beneficial for a wide range of tasks, including those with class imbalance. Since the pre-trained model has already learned a considerable amount of information, the need for a large and balanced dataset for the new task is reduced. This is particularly helpful when the available dataset for the new task is imbalanced.

Data privacy

Data privacy may not be on the forefront of data scientists concerns during the training and tuning of models. However, those issues become crucial for productization. For instance, HIPAA requires medical records be fully and properly de-identified: any violation incur significant financial penalties.

Data observability

Data observability is defined by 5 key attributes [ref 13]:
Freshness: How up to date is your data (latency, refresh rate…)
Quality: Is the distribution at inference differs from training, Missing data, duplicates
Volume: Change in traffic pattern, rate?
Schema: Change in data organization, new source of data
Lineage: Metadata on upstream and downstream source

The 3 pillars of observability data:
Metrics: refer to the numeric representation of data measured over time
Logs Records of an event with associated context
Traces: causality related events in a distributed environment.

Data downtime = Number of incidents * (Time to detection + Time to resolution)

Automated data pipeline

Finally, the variety and breath of techniques used to address data quality issues is a critical impediment for deploying and orchestrating these solutions. Quite often, these techniques are evaluated and deployed in an ad-hoc fashion.

One effective solution is to create a configurable streaming pipeline of various data manipulation and correction techniques that execute concurrently.

Open-source frameworks such as Apache Spark and Kafka can be starting point to build a data centric AI platform for test and production data.