Showing posts with label Information geometry. Show all posts
Showing posts with label Information geometry. Show all posts

Saturday, June 29, 2024

Fisher Information Matrix

Target audience: Advanced
Estimated reading time: 7'
The Fisher Information Matrix plays a crucial role in various aspects of machine learning and statistics. Its primary significance lies in providing a measure of the amount of information that an observable random variable carries about an unknown parameter upon which the probability depends.


Table of contents
       Key elements
       Use cases
Follow me on LinkedIn

What you will learn: How to estimate and visualize the Fisher information matrix for Normal and Beta distributions on a hypersphere.

Notes

  • Environments: Python  3.10.10, Geomstats 2.7.0
  • This article assumes that the reader is somewhat familiar with differential and tensor calculus [ref 1]. Please refer to our previous articles related to geometric learning listed on Appendix.
  • Source code is available at  Github.com/patnicolas/Data_Exploration/Information Geometry
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

This article is the 10th installments of our ongoing series focused on geometric learning. It introduces some basic elements of information geometry as an extension of differential geometry. As with previous articles, we utilize the Geomstats Python library [ref. 2] to implement concepts associated with geometric learning. 

NoteSummaries of my earlier articles on this topic can be found in the Appendix

As a reminder, the primary goal of learning Riemannian geometry is to understand and analyze the properties of curved spaces that cannot be described adequately using Euclidean geometry alone. 

Here is a synapsis of this article
  1. Brief introduction to information geometry
  2. Overview and mathematical formulation of the Fisher information matrix
  3. Computation of the Fisher metric to Normal and Beta distributions
  4. Implementation in Python using the Geomstats library

Information geometry

Information geometry applies the principles and methods of differential geometry to problems in probability theory and statistics [ref 3]. It studies the manifold of probability distributions and provides a natural framework for understanding and analyzing statistical models.

Key elements

  • Statistical manifolds: Families of probability distributions are considered as a manifold, with each distribution representing a point on this manifold.
  • Riemannian metrics: The Fisher information metric is commonly used to define a Riemannian metric on the statistical manifold. This metric measures the amount of information that an observable random variable carries about an unknown parameter.
  • Divergence measures: Divergence measures like the Kullback-Leibler (KL) divergence, which quantify the difference between two probability distributions.
  • Connections and curvature: Differential geometry concepts such as affine connections and curvature are used to describe the geometric properties of statistical models (i.e. α-connection family).
  • Dualistic inference: Exponential and mixture connections provide a rich structure for statistical inference.

Use cases

Here is a non-exclusive list of application of information geometry
  • Statistical Inference: Parameter estimation, hypothesis testing, and model selection (i.e. Bayesian posterior distributions and in the development of efficient sampling algorithms like Hamiltonian Monte Carlo)
  • Optimization: Natural gradient descent method uses the Fisher information matrix to adjust the learning rate dynamically, leading to faster convergence compared to traditional gradient descent.
  • Finance: Modeling uncertainties and analyzing statistical properties of financial models.
  • Machine Learning: Optimization of learning algorithms (i.e. Understanding the EM algorithm used in statistical estimation for latent variable model)
  • Neuroscience: Neural coding and information processing in the brain by modeling neural responses as probability distributions.
  • Robotics: Development of probabilistic robotics, where uncertainty and sensor noise are modeled using probability distributions.
  • Information Theory: Concepts for encoding, compression, and transmission of information.

Fisher information matrix

The Fisher information matrix is a type of Riemannian metric that can be applied to a smooth statistical manifold [ref 4]. It serves to quantify the informational difference between measurements. The points on this manifold represent probability measures defined within a Euclidean probability space, such as the Normal distribution. Mathematically, it is represented by the Hessian of the Kullback-Leibler divergence.

Let's consider a statistical manifold with coordinates (or parameters) θ and its probability density functions over an interval X as follow:\[P= \left \{ p(x, \theta); \ x \in X \ \int_{R}^{} p(x, \theta) dx = 1\right \}\]The Fisher metric is a Riemann metric tensor defined as the expectation of the partial derivative of the negative log likelihood over two coordinates θ.\[g_{ij}(\theta) = -E\left [ \frac{\partial^2\ log\ p(x,\theta) }{\partial \theta_{i}\partial\theta_{j}} \right ] = - \int_{R}^{}{\frac{\partial^2\ log\ p(x,\theta)) }{\partial \theta_{i}\partial\theta_{j}}}p(x, \theta)dx\]

The Fisher information or Fisher-Rao metric quantifies the amount of information in the data regarding a parameter θ. The Fisher-Rao metric, an intrinsic measure, enables the analysis of a finite, n-dimensional statistical manifold M.\[ds=\sum_{i=1}^{p}{\sum_{j=1}^{p}}g_{ij}\theta^{i}\theta^{j}\]
The Fisher metric for the normal distribution θ = {μ, σ} is computed as:\[\mathfrak{I}(\mu, \sigma)=-\textit{E}_{x-p}\begin{bmatrix} \frac{\partial ^2\ log\ p(\theta)}{\partial \mu^2} & \frac{\partial ^2\ log\ p(\theta)}{\partial \mu \partial \sigma} \\ \frac{\partial ^2\ log\ p(\theta)}{\partial \sigma \partial \mu} & \frac{\partial ^2\ log\ p(\theta)}{\partial \sigma^2} \end{bmatrix} = \begin{bmatrix} \sigma^{-2} & 0\\ 0 & 2\sigma^{-2} \end{bmatrix}\]
The Fisher metric for the beta distribution  θ = {α, β} is computed as:\[\varphi (z)=\frac{d^2}{dz^2}\ log \ \Gamma (z)\]
\[\mathfrak{I(\alpha,\beta)}=-\textit{E}_{x-p}\begin{bmatrix} \frac{\partial ^2\ log\ p(\theta)}{\partial \alpha^2} & \frac{\partial ^2\ log\ p(\theta)}{\partial \alpha \partial \beta} \\ \frac{\partial ^2\ log\ p(\theta)}{\partial \beta \partial \alpha} & \frac{\partial ^2\ log\ p(\theta)}{\partial \beta^2} \end{bmatrix}\]
\[\mathfrak{I(\alpha,\beta)}=\begin{bmatrix} \varphi(\alpha)-\varphi(\alpha+\beta) & -\varphi(\alpha+\beta)\\ -\varphi(\alpha+\beta) & \varphi(\beta)-\varphi(\alpha+\beta) \end{bmatrix}\]

Implementation

We leverage the following classes defined in the previous articles:
Let's first define a base class for all distributions to be defined on a hypersphere [ref 5].

class GeometricDistribution(object):
    _ZERO_TGT_VECTOR = [0.0, 0.0, 0.0]

    def __init__(self) -> None:
        self.manifold = HypersphereSpace(True)


    def show_points(self, num_pts: int, tgt_vector: List[float] = _ZERO_TGT_VECTOR) -> NoReturn:
        # Random point generated on the hypersphere
        manifold_pts = self._random_manifold_points(num_pts, tgt_vector)
        
        # Exponential map used to project the tgt vector on the hypersphere
        exp_map = self.manifold.tangent_vectors(manifold_pts)

        for v, end_pt in exp_map:
            print(f'Tangent vector: {v} End point: {end_pt}')
        self.manifold.show_manifold(manifold_pts)

The purpose of the method show_points is to display the various data point with optional tangent vector on the hypersphere. The argument num_pts specifies the number of random points to be defined in the hypersphere. The tangent vector is displayed if the argument tgt_vector not defined as the origin (_ZERO_TGT_VECTOR).


Normal distribution

The class NormalHypersphere encapsulates the display of the normal distribution on the hypersphere. The constructor initialized the normal distribution implemented in the Geomstats library.
The method show_distribution display num_pdfs probability density function over a set of num_manifold_pts, manifold points on the hypersphere. This specific implementation uses only two points. The Fisher-Rao metric is computed using the metric.geodesic Geomstats method.
The metric is applied to 100 points along the geodesic between the two points A and B. Finally, the density functions, pdfs are computed by converting the metric values to the NormalDistribution.point_to_pdf Geomstats method.

class NormalHypersphere(GeometricDistribution):

    def __init__(self) -> None:
        from geomstats.information_geometry.normal import NormalDistributions

        super(NormalHypersphere, self).__init__()
        self.normal = NormalDistributions(sample_dim=1)


    def show_distribution(self, num_pdfs: int, num_manifold_pts: int) -> NoReturn:
        manifold_pts = self._random_manifold_points(num_manifold_pts)
        A = manifold_pts[0]
        B = manifold_pts[1]

        # Apply the Fisher metric for the two manifold points on a Hypersphere
        geodesic_ab_fisher = self.normal.metric.geodesic(A.location, B.location)
        t = gs.linspace(0, 1, 100)

        # Generate the various density functions associated to the Fisher metric between the
        # two points on the hypersphere
        pdfs = self.normal.point_to_pdf(geodesic_ab_fisher(t))
        x = gs.linspace(0.2, 0.7, num_pdfs)

        for i in range(num_pdfs):
            plt.plot(x, pdfs(x)[i, :]/20.0)   # Normalization factor
        plt.title(f'Normal distribution on Hypersphere')
        plt.show()

Let's plot 2 randomly sampled data points associated with a tangent_vector on Hypersphere (1) then visualize 40 normalized normal probability density distributions (2).

normal_dist = NormalHypersphere()
num_points = 2
tangent_vector = [0.4, 0.7, 0.2]

         # 1. Display the 2 data points on the hypersphere
num_manifold_pts = normal_dist.show_points(num_points, tangent_vector)

        # 2. Visualize the 40 normal probabilities density functions
num_pdfs = 40
succeeded = normal_dist.show_distribution(num_pdfs, num_points)

Fig. 1 Two random data points on a Hypersphere with their tangent vectors 



Fig. 2 Visualization of Normal distribution between two random points on a hypersphere


Beta distribution

Let's wrap the evaluation of the Beta distribution on a hypersphere into the class BetaHypersphere that inherits GeometriDistribution. It leverages the BetaDistributions class in Geomstats. 

class BetaHypersphere(GeometricDistribution):

    def __init__(self) -> None:
        from geomstats.information_geometry.beta import BetaDistributions

        super(BetaHypersphere, self).__init__()
        self.beta = BetaDistributions()


    def show_distribution(self, num_manifold_pts: int, num_interpolations: int) -> NoReturn:

        # 1. Generate random points on Hypersphere using Von Mises algorithm
        manifold_pts = self._random_manifold_points(num_manifold_pts)
        t = gs.linspace(0, 1.1, num_interpolations)[1:]
        # 2. Define the beta pdfs associated with each
        beta_values_pdfs = [self.beta.point_to_pdf(manifold_pt.location)(t) for manifold_pt in manifold_pts]

        # 3. Generate, normalize and display each Beta distribution
        for beta_values in beta_values_pdfs:
            min_beta = min(beta_values)
            delta_beta = max(beta_values) - min_beta
            y = [(beta_value - min_beta)/delta_beta  for beta_value in beta_values]
            plt.plot(t, y)
        plt.title(f'Beta distribution on Hypersphere')
        plt.show()


The method show_distribution generates random points on the Hypersphere (1)  and compute the beta density function at these points using the Geomstats BetaDistributions.point_to_pdf (2).
The values generated by the pdfs are normalized then plotted (3)


Let's plot 10 randomly sampled data points on Hypersphere (1) then visualize 200 normalized beta probability density distributions (2).

beta_dist = BetaHypersphere()
        
num_interpolations = 200
num_manifold_pts = 10
    # 1. Display the 10 data points on the hypersphere
beta_dist.show_points(num_manifold_pts)

   # 2. Visualize the probabilities density functions with interpolation points
succeeded = beta_dist.show_distribution(num_manifold_pts, num_interpolations)
Fig. 3  10 random data points with on a Hypersphere


Fig. 4 Visualization of Beta distributions associated with 10 data points on hypersphere


References



--------------------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Appendix

Here is the list of published articles related to geometric learning:

Monday, January 22, 2024

Foundation of Geometric Learning

Target audience: Beginner
Estimated reading time: 4'
NewsletterGeometric Learning in Python   
     
Facing challenges with high-dimensional, densely packed but limited data, and complex distributions? Differential geometry offers a solution by enabling data scientists to grasp the true shape and distribution of data.

Table of contents
      Deep learning
Follow me on LinkedIn

What you will learn: You'll discover how differential geometry tackles the challenges of scarce data, high dimensionality, and the demand for independent representation in creating advanced machine learning models, such as graph or physics-informed neural networks.

Note
This article does not deal with the mathematical formalism of differential geometry or its implementation in Python.


Challenges 

Deep learning

Data scientists face challenges when building deep learning models that can be addressed by differential geometry. Those challenges are:
  • High dimensionality: Models related to computer vision or images deal with high-dimensional data, such as images or videos, which can make training more difficult due to the curse of dimensionality.
  • Availability of quality data: The quality and quantity of training data significantly affect the model's ability to generate realistic samples. Insufficient or biased data can lead to overfitting or poor generalization.
  • Underfitting or overfitting: Balancing the model's ability to generalize well while avoiding overfitting to the training data is a critical challenge. Models that overfit may generate high-quality outputs that are too similar to the training data, lacking novelty.
  • Embedding physics law or geometric constraints: Incorporating domain constraints such as boundary conditions or differential equations model s very challenging for high-dimensional data.
  • Representation dependence: The performance of many learning algorithms is very sensitive to the choice of representation (i.e. z-normalization impact on predictors).

Generative modeling

Generative modeling includes techniques such as auto-encoders, generative adversarial networks (GANs), Markov chains, transformers, and their various derivatives.

Creating generative models presents several specific challenges beyond plain vanilla deep learning models for data scientists and engineers, primarily due to the complexity of modeling and generating data that accurately reflects real-world distributions. The challenges that can be addressed with differential geometry include:
  • Performance evaluation: Unlike supervised learning models, assessing the performance of generative models is not straightforward. Traditional metrics like accuracy do not apply, leading to the development of alternative metrics such as the Frechet Inception Distance (FID) or Inception Score, which have their limitations.
  • Latent space interpretability: Understanding and interpreting the latent space of generative models, where the model learns a compressed representation of the data, can be challenging but is crucial for controlling and improving the generation process.


What is differential geometry

Differential geometry is a branch of mathematics that uses techniques from calculus, algebra and topology to study the properties of curves, surfaces, and higher-dimensional objects in space. It focuses on concepts such as curvature, angles, and distances, examining how these properties vary as one moves along different paths on a geometric object [ref 1]. 
Differential geometry is crucial in understanding the shapes and structures of objects that can be continuously altered, and it has applications in many fields including
physics (I.e., general relativity and quantum mechanics), engineering, computer science, and data exploration and analysis.

Moreover, it is important to differentiate between differential topology and differential geometry, as both disciplines examine the characteristics of differentiable (or smooth) manifolds but aim for different goals. Differential topology is concerned with the overarching structure or global aspects of a manifold, whereas differential geometry investigates the manifold's local and differential attributes, including aspects like connection and metric [ref 2].

In summary differential geometry provides data scientists with a mathematical framework facilitates the creation of models that are accurate and complex by leveraging geometric and topological insights [ref 3].


Applicability of differential geometry

Why differential geometry?

The following highlights the advantages of utilizing differential geometry to tackle the difficulties encountered by researchers in the creation and validation of generative models.

Understanding data manifolds: Data in high-dimensional spaces often lie on lower-dimensional manifolds. Differential geometry provides tools to understand the shape and structure of these manifolds, enabling generative models to learn more efficient and accurate representations of data.

Improving latent space interpolation: In generative models, navigating the latent space smoothly is crucial for generating realistic samples. Differential geometry offers methods to interpolate more effectively within these spaces, ensuring smoother transitions and better quality of generated samples.

Optimization on manifolds: The optimization processes used in training generative models can be enhanced by applying differential geometric concepts. This includes optimizing parameters directly on the manifold structure of the data or model, potentially leading to faster convergence and better local minima.

Geometric regularization: Incorporating geometric priors or constraints based on differential geometry can help in regularizing the model, guiding the learning process towards more realistic or physically plausible solutions, and avoiding overfitting.

Advanced sampling techniques: Differential geometry provides sophisticated techniques for sampling from complex distributions (important for both training and generating new data points), improving upon traditional methods by considering the underlying geometric properties of the data space.

Enhanced model interpretability: By leveraging the geometric structure of the data and model, differential geometry can offer new insights into how generative models work and how their outputs relate to the input data, potentially improving interpretability.

Physics-Informed Neural Networks:  Projecting physics law and boundary conditions such as set of partial differential equations on a surface manifold improves the optimization of deep learning models.

Innovative architectures: Insights from differential geometry can lead to the development of novel neural network architectures that are inherently more suited to capturing the complexities of data manifolds, leading to more powerful and efficient generative models. 

In summary, differential geometry equips researchers and practitioners with a deep toolkit for addressing the intrinsic challenges of generative AI, from better understanding and exploring complex data landscapes to developing more sophisticated and effective models [ref 3].

Representation independence

The effectiveness of many learning models greatly depends on how the data is represented, such as the impact of z-normalization on predictors. Representation Learning is the technique in machine learning that identifies and utilizes meaningful patterns from raw data, creating more accessible and manageable representations. Deep neural networks, as models of representation learning, typically transform and encode information into a different subspace. 
In contrast, differential geometry focuses on developing constructs that remain consistent regardless of the data representation method. It gives us a way to construct objects which are intrinsic to the manifold itself [ref 4].

Manifold and latent space

A manifold is essentially a space that, around every point, looks like Euclidean space, created from a collection of maps (or charts) called an atlas, which belongs to Euclidean space. Differential manifolds have a tangent space at each point, consisting of vectors. Riemannian manifolds are a type of differential manifold equipped with a metric to measure curvature, gradient, and divergence. 
In deep learning, the manifolds of interest are typically Riemannian due to these properties.

It is important to keep in mind that the goal of any machine learning or deep learning model is to predict p(y) from p(y|x) for observed features y given latent features x.\[p(y)=\int_{\Omega }^{} p(y|x).p(x)dx\].The latent space x can be defined as a differential manifold embedding in the data space (number of features of the input data).
Given a differentiable function f on a domain  a manifold of dimension d is defined by:
\[\mathit{M}=f(\Omega) \ \ \ with \ f: \Omega \subset \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}\]
In a Riemannian manifold, the metric can be used to 
  • Estimate kernel density
  • Approximate the encoder function of an auto-encoder
  • Represent the vector space defined by classes/labels in a classifier
A manifold is usually visualized with a tangent space at give point/coordinates.

Illustration of a manifold and its tangent space


The manifold hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space.

Studying data that reside on manifolds can often be done without the need for Riemannian Geometry, yet opting to perform data analysis on manifolds presents three key advantages [ref 5]:
  • By analyzing data directly on its residing manifold, you can simplify the system by reducing its degrees of freedom. This simplification not only makes calculations easier but also results in findings that are more straightforward to understand and interpret.
  • Understanding the specific manifold to which a dataset belongs enhances your comprehension of how the data evolves over time.
  • Being aware of the manifold on which a dataset exists enhances your ability to predict future data points. This knowledge allows for more effective signal extraction from datasets that are either noisy or contain limited data points.

Graph Neural Networks

Graph Neural Networks (GNNs) are a type of deep learning models designed to perform inference on data represented as graphs. They are particularly effective for tasks where the data is structured in a non-Euclidean manner, capturing the relationships and interactions between nodes in a graph.

Graph Neural Networks operate by conducting message passing across a graph, in which features are transmitted from one node to another through the connecting edges (diffusion process). For instance, the concept of Ricci curvature from differential geometry helps to alleviate congestion in the flow of messages [ref 6].

Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) are versatile models capable of integrating physical principles, governed by partial differential equations, into the learning mechanism. They utilize these physical laws as a form of soft constraint or regularization during training, effectively addressing the challenge of limited data in certain engineering applications [ref 7].

Information geometry

Information geometry is a field that combines ideas from differential geometry and information theory to study the geometric structure of probability distributions and statistical models. It focuses on the way information can be quantified, manipulated, and interpreted geometrically, exploring concepts like distance and curvature within the space of probability distributions
This approach provides a powerful framework for understanding complex statistical models and the relationships between them, making it applicable in areas such as machine learning, signal processing, and more [ref 8].


Python libraries for differential geometry

There are numerous open-source Python libraries available, with a variety of focuses not exclusively tied to machine learning or generative modeling:

References

[8] Information Geometry: Near Randomness and Near Independence - 
      K. Arvin, CT Dodson - Springer-Verlag 2008



-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.