Wednesday, February 15, 2023

Improve Python with Pydantic 2.x

Target audience: Advanced
Estimated reading time: 3'

In my tenure as a Scala developer, I frequently underscored the importance of robust type checking mechanisms. When transitioning to Python, it was the incorporation of the Pydantic library and the native typing package that significantly underscored Python's capabilities as a mature programming language, bolstering its appeal to developers like me.

This article delves into the contemporary features of the Pydantic library, elucidating concepts such as serialization, annotated validators, and discriminated unions.


Table of contents
Follow me on LinkedIn
Notes:
  • Environments: Python 3.10, Pydantic 2.4.2
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Why pydantic?

Pydantic [ref 1] is a Python library designed for data modeling and parsing, equipping Python developers with customizable tools for error handling and validation. This library seamlessly integrates built-in features for JSON encoding and decoding, combined with validation. It offers a declarative approach to data representation.

Mirroring the typing module, Pydantic aims to emulate the type checking seen in static languages like Java and C++ through IDE-enforced hints. Moreover, it enhances the Python dataclasses module by integrating agile error handling and data validation capabilities.

Pydantic has several benefits:
  • Dedicated, specific types such as FilePath or HttpUrl for validation.
  • Custom validators.
  • Options (Optional type) similar to Java and Scala.
  • Integration with lint functionality in most common IDE.
  • Declarative syntax for modeling data.
  • Wrapper (Field type) for data validation.

To install the Pydantic library (MacOS)
       pip install pydantic

Notes: 
  • This is a sample of features available in version 2.2.1 [ref 2]. The reader is invited to explore the entire functionality of pedantic.
  • A migration from Pydantic version 1.x is available to Python developers [ref 3].
  • FastAPI web framework leverages Pydantic library [ref 4].

Serialization validators

One basic function is validation of JSON representation of an object. The following example combines Pydantic data class Algorithm as inherited from BaseModel and data type from typing module.
An algorithm instance if completely defined by its name (i.e., Binary tree), time complexity (i.e., O(logN), asymptotic search time complexity [ref 5] and supported programming languages (i.e., C++, Python, Go, ...). For sake of clarity, time and space complexity is defined as an enumerator.

from pydantic import BaseModel, ValidationError
from typing import AnyStr, Tuple, Dict, Optional
from enum import Enum

""
    # Time complexity for average case for search operation
class AlgorithmComplexity(Enum): O1 = 0 # Hash table OlogN = 1 # Binary search, red-black tree, ... ON = 2 # Linked list, stack, queue, ... ONlogN = 3 # Quick, merge, shell sort, ... ON2 = 4 # Insertion sort, ... class Algorithm(BaseModel): name: AnyStr time_complexity: AlgorithmComplexity space_complexity: AlgorithmComplexity languages: Tuple
# Tuple of supported programming languages


In the following code snippet, the Pydantic constructor Algorithm.__init__ validates the textual/JSON representation of the data which fails for the improperly definition of the linked list.

if __name__ == '__main__':

       # Correct definition of a binary tree
binary_tree_json = { 'name': 'Binary Tree', 'time_complexity': AlgorithmComplexity.OlogN, 'space_complexity': AlgorithmComplexity.OlogN, 'languages': ('java', 'C++', 'Python') } try: binary_tree = Algorithm(**binary_tree_json) except ValidationError as err: logging.error(err.errors())

      # Incorrect specification of the time complexity to access a linked list 
linked_list_json = { 'name': 'LinkedList', 'time_complexity': 'N', 'space_complexity': AlgorithmComplexity.ON, 'languages': ('java', 'C++', 'Scala') } try: linked_list = Algorithm(**linked_list_json) except ValidationError as err: logging.error(err.errors()) # [{'type': 'enum', 'loc': ('time_complexity',), 'msg': 'Input should be 0,1,2,3 or 4', # 'input': 'N2', 'ctx': {'expected': '0,1,2,3 or 4'}}]



Annotated validator

Pydantic provides a way to apply validators on a field value or types . You should use this whenever you want to bind validation to a type instead of model or field.

There are several types of validators:
  • After validator which is run after Pydantic internal parser and validator.
  • Before validator, run before parsing and validation
  • Plain validator which is a variant of before validator terminating the validation once it fails.
  • Wrap validator which supports the functionality of before, after and plain validators.
The following example implements a 'before validator': the languages listed for a given algorithm should be valid as specified in the tuple, valid_languages.

Here are the 2 steps:
  1. Define a validation function, validate_languages. Note that the type of the output, Tuple is the same as the type of input (argument)
  2. Convert the variable, languages as an annotation which takes the input/output type, Tuple as first argument and the validation function as second argument.

from typing_extensions import Annotated
from pydantic.functional_validators import BeforeValidator



# Formal list of programming languages used to implement the algorithms
valid_languages: Tuple = ('python', 'go', 'scala', 'c++', 'java')

"""
    Remove the languages which are not correctly specified (typos, non-compliant, ..)
"""
def validate_languages(languages: Tuple) -> Tuple:
    return (lang for lang in languages if lang.lower() in valid_languages)


        # We add the annotator for the programming language field.
class Algorithm(BaseModel):
    name: AnyStr
    time_complexity: AlgorithmComplexity   
    space_complexity: AlgorithmComplexity
                         # is the list of supported languages legitimate?
    languages: Annotated[Tuple, BeforeValidator(validate_languages)]



The simplest validation scheme is to make sure the programming languages available for this algorithm are valid. We do not make any assumption whether programming languages are specified with lower or upper case characters.

  
binary_tree_json = {
     'name': 'Binary Tree',
     'time_complexity': AlgorithmComplexity.OlogN,
     'space_complexity': AlgorithmComplexity.OlogN,
     'languages': ('Java', 'C+++', 'Python')    # C+++ is not a valid programming language
 }
    
 try:
     binary_tree = Algorithm(**binary_tree_json)
     logging.info(str(binary_tree.languages))   # ('Java', 'Python')
 except ValidationError as err:
     logging.error(err.errors())



Discriminated unions for composition

The goal of discriminated unions is to resolve the need of multi-inheritance by creating a composite of mutual exclusive types.
Let's consider a simple hierarchy of HTML elements:


Class diagram in sub-classing using union discriminated on html_type

The HTMLText class for the HTML fragment <p>[text]</p> and HTMLImage class for the HTML component <img src=[url] /> inherit from the base abstract class HTMLElement. The member html_type declared with type Literal defines explicitly the type of element.
The goal is to create a new composite type, WebElement, that can invoke either HTMLText or (exclusive) HTMLImage class.

import abc
from typing import AnyStr, Union, Literal
from pydantic import BaseModel, Field


class HTMLElement(BaseModel):
   html_type: AnyStr

   @abc.abstractmethod
   def get(self) -> AnyStr:
      pass


class HTMLText(HTMLElement):
   html_type: Literal["p"]
   doc: AnyStr
    
   def get(self) -> AnyStr:
      return f'<{self.html_type}>{self.doc}<{self.html_type}/>'


class HTMLImage(HTMLElement):
   html_type: Literal["img"]
   img_url: AnyStr
    
   def get(self) -> AnyStr:
       return f'<{self.html_type} src={self.img_url} />'



The objective is to create a mutually exclusive composite, WebElement of HTMLImage and HTMLText using html_type as discriminator.
The method polymorphic method get illustrates the mechanism by generating the actual textual representation of the HTML components.

class WebElement(BaseModel):
  element: Union[HTMLImage, HTMLText] = Field(discriminator="html_type")

  def get(self) -> AnyStr:
        return self.element.get()


The simple test consists of creating a dictionary, test_element to initialize a WebElement instance with type 'img'.

if __name__ == '__main__':
  html_text = HTMLText(html_type='p', doc='This is correct')
  print(html_text.get())  # <p>This is correct<p/>
    
  test_element: Dict = {"img_url": "python.png", "text": "", "html_type": "img"}
  web_element = WebElement(element=test_elementl)
  print(web_element.get()) # <img src=python.png />


Note: For readers familiar with Scala, discriminated unions are somewhat similar to mixins.

Thank you for reading this article. For more information ...

References

[4FastAPI


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3