Sunday, April 7, 2024

Functional Data Analysis in Python

Target audience: Advanced
Estimated reading time: 7'

In the realms of healthcare and IT monitoring, I encountered the challenge of managing multiple data points across various variables, features, or observations. Functional data analysis (FDA) is well-suited for addressing this issue. 
This article explores how the Hilbert sphere can be used to conduct FDA in non-linear spaces.

Table of contents
        FDA methods
        Formal notation
        Manifold structure
        Inner product
        Exponential map
        Logarithm map
References
Follow me on LinkedIn

What you will learnBasic concepts of functional data analysis in non-linear spaces through the use of manifolds, along with a hands-on application of Hilbert space using Geomstats in Python.

Notes
  • Environments: Python  3.10.10, Geomstats 2.7.0
  • This article assumes that the reader is somewhat familiar with differential and tensor calculus [ref 1]. Please refer to the previous articles related to geometric learning [ref 2, 3].
  • Source code is available at  Github.com/patnicolas/Data_Exploration/manifolds
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.'

Introduction

This article provides a summary of functional data analysis and then proceeds to introduce and implement a technique specifically for non-linear manifolds: Hilbert sphere.

This article is the 6th installment in our series on Geometric Learning in Python following

Functional data analysis

Functional data analysis (FDA) is a statistical approach designed for analyzing curves, images, or functions that exist within higher-dimensional spaces [ref 4].

Observation data types

Panel Data:
In fields like health sciences, data collected through repeated observations over time on the same individuals is typically known as panel data or longitudinal data. Such data often includes only a limited number of repeated measurements for each unit or subject, with varying time points across different subjects.

Time Series:
This type of data comprises single observations made at regular time intervals, such as those seen in financial markets.

Functional Data:
Functional data involves diverse measurement points across different observations (or subjects). Typically, this data is recorded over consistent time intervals and frequencies, featuring a high number of measurements per observational unit/subjects.

FDA methods

Methods in Functional Data Analysis are classified based on the type of manifold (linear or nonlinear) and the dimensionality or feature count of the space (finite or infinite). The categorization and examples of FDA techniques are demonstrated in the table below.

In Functional Data Analysis (FDA), the primary subjects of study are random functions, which are elements in a function space representing curves, trajectories, or surfaces. The statistical modeling and inference occur within this function space. Due to its infinite dimensionality, the function space requires a metric structure, typically a Hilbert structure, to define its geometry and facilitate analysis.

When the function space is properly established, a data scientist can perform various analytical tasks, including:
  • Computing statistics such as mean, covariance, and mode
  • Conducting classification and regression analyses
  • Performing hypothesis testing with methods like T-tests and ANOVA
  • Executing clustering
  • Carrying out inference
The following diagram illustrates a set of random functions around a smooth function X(tilde) over the interval [0, 1] \[\tilde{X}(t)=3e^{-t^{2}}.sin(25t)-2t^{2}+5 \ \ \ t\in [0, 1]\]

Fig. 1 Visualization of a random functions on Hilbert space

Methods in FDA are classified based on the type of manifold (linear or nonlinear) and the dimensionality or feature count of the space (finite or infinite). The categorization and examples of FDA techniques are demonstrated in the table below.

 Manifold
Dimension
Linear
Non-linear
Finite       Euclidean Rn.  RnSpecial orthogonal SO(3)
InfiniteSquare IntegrableHilbert sphere
          Table 1: Illustration of categorization of FDA techniques

This article focuses on Hilbert space which is specific function space equipped with a Riemann metric (inner product).

Formal notation

Let's consider a sample {x} generated by n Xi random functions as \[x_{i}(t)=X_{i}(t)_{i:1, n} \in \mathbb{R}\ \ \ \ t\in \top \subset \mathbb{R}\]
The function space is a manifold of square integrable functions defined as\[\textit{L}^{2}(T)=\left \{ f: T\rightarrow \mathbb{R}| \int_{T}^{.} f(t)^{T}f(t)dt < \infty \right \}\]The Riemann metric tensor is defined for tangent vectors f and is induced from and equal to the inner product:\[\left \langle f, g \right \rangle = \int_{T}^{} f(t)^{T}g(t)dt\ \ \ \left \| f \right \| _{\mathit{L}^{2}}=\sqrt{\left \langle f, f \right \rangle} \ \ \ \  (1)\]

Hilbert sphere

Hilbert space is a type of vector space that comes with an inner product, which establishes a distance function, making it a complete metric space. In the context of functional data analysis, attention is primarily given to functions that are square-integrable [ref 5].

Hilbert space has numerous important applications:
  • Probability theory: The space of random variables centered by the expectation
  • Quantum mechanics:
  • Differential equations:
  • Biological structures: (Protein structures, folds,..)
  • Medical imaging (MRI, CT-SCAN,...)
  • Meteorology

The Hilbert sphere S, which is infinite-dimensional, has been extensively used for modeling density functions and shapes, surpassing its finite-dimensional equivalent. This spherical Hilbert geometry facilitates invariant properties and allows for the efficient computation of geometric measures.

The Hilbert sphere is a particular case of function space defined as:\[H(T)=\left \{ f: T\rightarrow \mathbb{R} | \ \ \left \| f \right \|_{L^{2}}= 1 \right \}\]The Riemannian exponential map at p from the tangent space to the Hilbert sphere preserves the distance to origin and defined as:\[exp_{p}(f)=cos\left ( \left \| f \right \|_{E} \right )p+sin\left ( \left \| f \right \|_{E} \right)\frac{f}{\left \| f \right \|_{E}} \ \ \ \ (2) \] where ||f||E is the norm of f in the Euclidean space.
The logarithm (or inverse exponential) map is defined at point p, is defined as \[log_{p}(f)=arccos\left (\left \langle p, f \right \rangle_{p} \right )\frac{f}{\left \| f \right \|} \ \ \ \  (3) \]

Implementation

We will illustrate the various coordinates on the hypersphere space we introduced in a previous article Geometric Learning in Python: Manifolds
We leverage class ManifoldPoint introduced in our previous post, ManifoldPoint definition and used across our series on geometric learning. 
As a reminder:

@dataclass
class ManifoldPoint:
id: AnyStr
location: np.array
tgt_vector: List[float] = None
geodesic: bool = False
intrinsic: bool = False

Manifold structure

Let's develop a wrapper class named FunctionSpace to facilitate the creation of points on the Hilbert sphere and to carry out the calculation of the inner product, as well as the exponential and logarithm maps related to the tangent space. 

Our implementation relies on Geomstats library [ref 6] introduced in Geometric Learning in Python: Manifolds 

The function space will be constructed using num_domain_samples, which are evenly spaced real values within the interval [0, 1]. Points on a manifold can be generated using either the Geomstats HilbertSphere.random_point method or by specifying a base point, base_point, and a directional vector.

from geomstats.geometry.functions import HilbertSphere, HilbertSphereMetric


class FunctionSpace(HilbertSphere):
  def __init__(self, num_domain_samples: int):
      domain_samples = gs.linspace(0, 1, num=num_domain_samples)
      super(FunctionSpace, self).__init__(domain_samples, True)

  @staticmethod
  def create_manifold_point(id: AnyStr, vector: np.array, base_point: np.array) -> ManifoldPoint:
     
    # Compute the tangent vector using the direction 'vector' and point 'base_point'
     tgt_vector =  self.to_tangent(vector, base_point)
     return ManifoldPoint(id, base_point, tgt_vector)

  def random_manifold_points(self, n_samples: int) -> List[ManifoldPoint]: 
     return [ManifoldPoint(
           id=f'rand_{n+1}',
           location=random_pt) 
           for n, random_pt in enumerate(self.random_point(n_samples))]

Let's generate a point on the Hilbert sphere using a random base point on the manifold and a 4 dimension vector.

num_samples = 4
function_space = FunctionSpace(num_samples)
random_base_pt = function_space.random_point()

vector = np.array([1.0, 0.5, 1.0, 0.0])
manifold_pt = function_space.create_manifold_point('id', vector, random_pt)

Output:
Manifold point: 
    Base point=[[0.13347 0.85738 1.48770 0.29235]], 
    Tangent Vector=[[ 0.91176 -0.0667 0.01656 -0.19326]],
    No Geodesic, 
    Extrinsic

Inner product

Let's wrap the formula (1) into a method. We introduce the inner_product method to the FunctionSpace class, which serves to encapsulate the call to self.metric.inner_product from the Geomstats method HilbertSphere.inner_product

This method requires two parameters:
  • vector_1: The first vector used in the computation of the inner product
  • vector_2: The second vector used in the computation of the inner product
The second method, manifold_point_inner_product, adds the base point on the manifold without assumption of parallel transport. The base point is origin of both the tangent vector associated with the base point, manifold_base_pt and the tangent vector associated with the second point, manifold_pt.

def inner_product(self, tgt_vector1: np.array, tgt_vector2: np.array) -> np.array:
   return self.metric.inner_product(tgt_vector1,tgt_vector2)

def manifold_point_inner_product(
       self, 
       manifold_base_pt: ManifoldPoint, 
       manifold_pt: ManifoldPoint) -> np.array:

   return self.metric.inner_product(
               manifold_base_pt.tgt_vector,
               manifold_pt.tgt_vector,
            manifold_base_pt.location)

Let's calculate the inner product of two specific numpy vectors in an 8-dimensional space, using our class, FunctionSpace and focusing on the Euclidean inner product and the norm on the tangent space for one of the vectors.

num_Hilbert_samples = 8
functions_space = FunctionSpace(num_Hilbert_samples)
        
vector1 = np.array([0.5, 1.0, 0.0, 0.4, 0.7, 0.6, 0.2, 0.9])
vector2 = np.array([0.5, 0.5, 0.2, 0.4, 0.6, 0.6, 0.5, 0.5])
inner_prod = functions_space.inner_product(vector1, vector2)
print(f'Inner product of vectors 1 & 2: {str(inner_prod)}')
print(f'Euclidean norm of vector 1: {np.linalg.norm(vector)}')
print(f'Norm of vector 1: {str(math.sqrt(inner_prd))}')

Output:
Inner product of vectors1 & 2: 0.2700
Euclidean norm of vector 1: 1.7635
Norm of vector 1: 0.6071

Exponential map

Let's wrap the formula (2) into a method. We introduce the exp method to the FunctionSpace class, which serves to encapsulate the call to self.metric.exp from the Geomstats method HilbertSphere.exp

This method requires two parameters:
  • vector: The directional vector used in the computation the exponential map
  • manifold_base_pt: The base point on the manifold.
def exp(self, vector: np.array, manifold_base_pt: ManifoldPoint) -> np.array:
   return self.metric.exp(tangent_vec=vector, base_point=manifold_base_pt.location)

Let's compute the exponential map at a random base point on the manifold, for a numpy vector of 8-dimensional, using the class, FunctionSpace.

num_Hilbert_samples = 8
function_space = FunctionSpace(num_Hilbert_samples)

vector = np.array([0.5, 1.0, 0.0, 0.4, 0.7, 0.6, 0.2, 0.9])
assert num_Hilbert_samples == len(vector)
        
exp_map_pt = function_space.exp(vector, function_space.random_manifold_points(1)[0])
print(f'Exponential on Hilbert Sphere:\n{str(exp_map_pt)}')

Output:
Exponential on Hilbert Sphere: 
[0.97514 1.6356 0.15326 0.59434 1.06426 0.74871 0.24672 0.95872]

Logarithm map

Let's wrap the formula (3) into a method. We introduce the log method to the FunctionSpace class, which serves to encapsulate the call to self.metric.log from the Geomstats method HilbertSphere.log

This method requires two parameters:
  • manifold_base_pt: The base point on the manifold.
  • target_pt: Another point on the manifold, used to produce the log map.

def log(self, manifold_base_pt: ManifoldPoint, target_pt: ManifoldPoint) ->np.array:
   return self.metric.log(point=manifold_base_pt.location, base_point=target_pt.location)

Let's compute the exponential map at a random base point on the manifold, for a numpy vector of 8-dimensional, using the class, FunctionSpace.

num_Hilbert_samples = 8
function_space = FunctionSpace(num_Hilbert_samples)

random_points = function_space.random_manifold_points(2)
log_map_pt = function_space.log(random_points[0], random_points[1])
print(f'Logarithm from Hilbert Sphere {str(log_map_pt)}')

Output:
Logarithm from Hilbert Sphere 
[1.39182 -0.08986 0.32836 -0.24003 0.30639 -0.28862 -0.431680 4.15148]


References




-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.









Wednesday, April 3, 2024

Geometric Learning in Python: Vector Operators

Target audience: Beginner
Estimated reading time: 5'

Physics-Informed Neural Networks (PINNs) are gaining popularity for integrating physical laws into deep learning models. Essential tools like vector operators, including the gradient, divergence, curl, and Laplacian, are crucial in applying these constraints.


Table of contents

Follow me on LinkedIn

What you will learn: How to implement the vector gradient, divergence, curl and laplacian operators in Python using SymPy library.

Notes
  • Environments: Python  3.10.10,  SymPy 1.12, Matplotlib 3.8.2
  • This article assumes that the reader is familiar with differential and tensor calculus.
  • Source code is available at github.com/patnicolas/Data_Exploration/diffgeometry
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

Geometric learning addresses the difficulties of limited data, high-dimensional spaces, and the need for independent representations in the development of sophisticated machine learning models.

Note: This article is the 5th installment in our series on Geometric Learning in Python following
The following description of vector differential operators leverages some of concept defined in the previous articles in the Geometric Learning in Python series and SymPy library.

SymPy is a Python library dedicated to symbolic mathematics. Its implementation is as simple as possible to be comprehensible and easily extensible with support for differential and integral calculus, Matrix operations, algebraic and polynomial equations, differential geometry, probability distributions, 3D plotting. The source code is available at  github.com/sympy/sympy.git

In tensor calculus, a vector operator is a type of differential operator:
  • The gradient transforms a scalar field into a vector field.
  • The divergence changes a vector field into a scalar field.
  • The curl converts a vector field into another vector field.
  • The laplacian takes a scalar field and yields another scalar field.

Gradient

Consider a scalar field f in a 3-dimension space. The gradient of this field is defined as the vector of the 3 partial derivatives with respect to xy and z [ref 1].\[\triangledown f= \frac{\partial f}{\partial x} \vec{i} + \frac{\partial f}{\partial y} \vec{j} + \frac{\partial f}{\partial z} \vec{k}\]
We create a class, VectorOperators, that wraps the following operators:  gradient, divergence, curl and laplacian. The constructor initializes the function as an expression for which the various operators are applied to.

class VectorOperators(object):
    def __init__(self, expr: Expr):     # Expression for the input function
        self.expr = expr

    def gradient(self) -> VectorZero:
        from sympy.vector import gradient

        return gradient(self.expr, doit=True)

Let's calculate the gradient vector for the function \[f(x,y,z)=x^{2}+y^{2}+z^{2} \] , as depicted in the plot below.

Fig. 1  3D visualization of function f(x,y,z) = x**2 + y**2 + z**2

The Python for the visualization of the function is described in Appendix (Visualization function)

The following code snippet computes the gradient of this function as \[\triangledown f(x,y,z) = 2x.\vec{i} + 2y.\vec{j} + 2y.\vec{k}\] 
r = CoordSys3D('r')
f = r.x*r.x+r.y*r.y+r.z*r.z

vector_operator = VectorOperators(f)
grad_f = vector_operator.gradient()    # 2*r.x + 2*r.y+ 2*r.z

The function f is defined using the default Euclidean coordinates rThe gradient is depicted in the following plot implemented using Matplotlib module. The actual implementation is described in Appendix (Visualization gradient)

Fig. 2  3D visualization of grad_f(x,y,z) = 2x.i + 2y.j + 2z.k


Divergence

Divergence is a vector operator used to quantify the strength of a vector field's source or sink at a specific point, producing a signed scalar value. When applied to a vector F, comprising components XY, and Z, the divergence operator consistently yields a scalar result [ref 2]\[div(F)=\triangledown .F=\frac{\partial X}{\partial x}+\frac{\partial Y}{\partial y}+\frac{\partial Z}{\partial z}\]
Let's implement the computation of the divergence as method of the class VectorOperators:

def divergence(self, base_vec: Expr) -> VectorZero:
    from sympy.vector import divergence

    div_vec = self.expr*base_vec
    return divergence(div_vec, doit=True)

Using the same instance vector_operators as with the gradient calculation:
divergence = vector_operators.divergence(r.i + r.j + r.k)

The execution of the code above produces the following:\[div(f(x,y,z)[\vec{i} + \vec{j}+ \vec{k}]) = 2x(y+z+xy)\]


Curl

In mathematics, the curl operator represents the minute rotational movement of a vector in three-dimensional space. This rotation's direction follows the right-hand rule (aligned with the axis of rotation), while its magnitude is defined by the extent of the rotation [ref 2]. Within a 3D Cartesian system, for a three-dimensional vector F, the curl operator is defined as follows: \[ \triangledown * \mathbf{F}=\left (\frac{\partial F_{z}}{\partial y}- \frac{\partial F_{y}}{\partial z} \right ).\vec{i} + \left (\frac{\partial F_{x}}{\partial z}- \frac{\partial F_{z}}{\partial x} \right ).\vec{j} + \left (\frac{\partial F_{y}}{\partial x}- \frac{\partial F_{x}}{\partial y} \right ).\vec{k} \]. Let's add the curl method to our class VectorOperator as follow:

def curl(self, base_vectors: Expr) -> VectorZero:
    from sympy.vector import curl

    curl_vec = self.expr*base_vectors
    return curl(curl_vec, doit=True)

For the sake of simplicity let's compute the curl along the two base vectors j and k:
curl_f = vector_operator.curl(r.j+r.k)    # j and k direction only

The execution of the curl method outputs: \[curl(f(x,y,z)[\vec{j} + \vec{k}]) = x^{2}\left ( z-y \right).\vec{i}-2xyz.\vec{j}+2xyz.\vec{k}\]


Laplacian

In mathematics, the Laplacian, or Laplace operator, is a differential operator derived from the divergence of the gradient of a scalar function in Euclidean space [ref 3].  The Laplacian is utilized in various fields, including calculating gravitational potentials, solving heat and wave equations, and processing images.
It is a second-order differential operator in n-dimensional Euclidean space and is defined as follows: \[\triangle f= \triangledown ^{2}f=\sum_{i=1}^{n}\frac{\partial^2 f}{\partial x_{i}^2}\]. The implementation of the method laplacian reflects the operator is the divergence (step 2) of the gradient (Step 1)

def laplacian(self) -> VectorZero:
   from sympy.vector import divergence, gradient
        
   # Step 1 Compute the Gradient vector
   grad_f = self.gradient()
        
   # Step 2 Apply the divergence to the gradient
   return divergence(grad_f)

Once again we leverage the 3D coordinate system defined in SymPy to specify the two functions for which the laplacian has to be evaluated

r = CoordSys3D('r')
    
f = r.x*r.x + r.y*r.y + r.z*r.z
vector_operators = VectorOperators(f)
laplace_op = vector_operators.laplacian()
print(laplace_op)       # 6

f = r.x*r.x*r.y*r.y*r.z*r.z  
vector_operators = VectorOperators(f)
laplace_op = vector_operators.laplacian()
print(laplace_op)     # 2*r.x**2*r.y**2 + 2*r.x**2*r.z**2 + 2*r.y**2*r.z**2

Outputs: \[\triangle (x^{2} + y^{2}+z^{2}) = 6\] \[\triangle (x^{2}y^{2}z^{2})=2(x^{2}y^2 +x^{2}z^{2}+y^{2}z^{2}) \]


-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Appendix

Visualization function

The implementation relies on 
  • Numpy meshgrid to setup the axis and values for the x, y and z axes. 
  • Matplotlib scatter method to display the values generated by the function f
The function to plot takes x, y and z grid values and to generate data f(x,y,z)

@staticmethod
def show_3D_function(f: Callable[[float, float, float], float], grid_values: np.array) -> NoReturn:
   import matplotlib.pyplot as plt
   from mpl_toolkits.mplot3d import axes3d

        # Setup the grid with the appropriate units and boundary
   x, y, z = np.meshgrid(grid_values, grid_values, grid_values)

        # Apply the function f
   data = f(x, y, z)

        # Set up plot (labels, legend,..)
   ax: Axes =  self.__setup_3D_plot('3D Plot f(x,y,z) = x^2 + y^2 + z^2')
        
        # Display the data along x, y and z using scatter plot
   ax.scatter(x, y, z, c=data) 
   plt.show()



Visualization gradient

The 3 components of the gradient vector as passed as argument, grad_f.
The implementation relies on 
  • Numpy meshgrid to setup the axis and values for the x, y and z axes. 
  • Matplotlib quiver method to display the gradient vectors at each grid values  
@staticmethod
def show_3D_gradient(grad_f: List[Callable[[float], float]], grid_values: np.array) -> NoReturn:
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import axes3d

        # Setup the grid with the appropriate units and boundary
    x, y, z = np.meshgrid(grid_values, grid_values, grid_values)
    ax = self.__setup_3D_plot('3D Plot Gradient 2x.i + 2y.j + 2z.k')

        # Extract the gradient df/dx, df/dy and df/dz
    X = grad_f[0](x)
    Y = grad_f[1](y)
    Z = grad_f[2](z)

        # Display the gradient vectors as vector fields
    ax.quiver(x, y, z, X, Y, Z, length=1.5, color='grey', normalize=True)
    plt.show()