Tuesday, July 16, 2024

Riemannian Metric for SPD Manifolds

Target audience: Intermediate
Estimated reading time: 7'

Choosing between Riemannian and Euclidean metrics for classifying signal or dense data can be challenging. This article offers engineers an easy method to select the most suitable metric using the pyRiemann library.



      Setup
      Implementation
      Scoring


What you will learn: How to determine the suitable metric for a given dataset to optimize the decision boundary and accuracy for any classifier.



Notes

  • Environments: Python  3.11, pyRiemann 0.6, mne 1.7.1, SciKit-learn 1.5.1,  Matplotlib 3.9.1
  • This article assumes that the reader is somewhat familiar with Riemannian geometry [ref 1, 2]. 
  • Source code is available at  Github.com/patnicolas/Data_Exploration/classifiers
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

Geometric learning tackles the challenges posed by scarce data, high-dimensional spaces, and the requirement for independent representations in the creation of advanced machine learning models. The fundamental aim of studying Riemannian geometry is to comprehend and scrutinize the characteristics of curved spaces, which are not sufficiently explained by Euclidean geometry alone. 

This article is the 11th installment in our ongoing series on geometric learning. Most articles in this series use the Geomstats differential geometry library to implement some of the most common machine learning algorithms on the hypersphere.
In this post, we use
pyRiemann [ref 3], a Python library dedicated to signal processing and time series analysis on Riemannian manifolds.

The two most common Riemann metric for SPD matrices are:
  • Affine-Invariant metric
  • Log-Euclidean metric
These two metrics have been described and evaluated in a previous post: [ref 4]

Symmetric positive definite (SPD) manifold

SPD matrices have been introduced in a previous article in Logistic Regression on Riemann Manifolds

A square matrix A is symmetric if it is identical to its transpose, meaning that if aaij are the entries of A, then aaij=ajiaij. This implies that A can be fully described by its upper triangular elements.
aij
A square matrix A is positive definite if, for every non-zero vector b, the product bTAb >= 0 [ref 5].

If a matrix A is both symmetric and positive definite, it is referred to as a symmetric positive definite (SPD) matrix. This type of matrix is extremely useful and appears in various real-world applications. A prominent example in statistics is the covariance matrix, where each entry represents the covariance between two variables (with diagonal entries indicating the variances of individual variables). Covariance matrices are always positive semi-definite (meaning bTAb0), and they are positive definite if the covariance matrix has full rank, which occurs when each row is linearly independent from the others.
The collection of all SPD matrices of size n×n forms a manifold.

pyRiemann library

pyRiemann is a Python machine learning package built on the scikit-learn API. It offers a high-level interface for classifying real or complex-valued multivariate data using the Riemannian geometry of symmetric positive definite (SPD) and Hermitian positive definite (HPD) matrices.

Its primary aim is to conduct multivariate data analysis on time series within these Riemannian manifolds. Our use case consists of comparing various known machine learning algorithms in Euclidean and Riemann space (SPD matrices). The data sets consists of signals such as Electroencephalograms (EEG) or Magnetic Resonance Images (MRI) used in brain-computer interface (BCI) and required mne library to be loaded.

Setup

The installation and configuration of pyRiemann is straight forward [ref 6].

To Install pyRiemann module: pip install pyriemann
To install mne: pip install mne
Source from Github: git clone https://github.com/pyRiemann/pyRiemann.git


Datasets

We are comparing two widely used machine learning algorithms, Support Vector Machine (SVM) and k-Nearest Neighbors (k-NN), for Symmetric Positive Definite matrices in Euclidean space (using Scikit-learn) and on a Riemannian manifold. The comparison will be conducted using two datasets:
  • Set 1: Data from Electroencephalograms (Brain-Computer Interface) [ref 7]
  • Set 2: Synthetic Gaussian distributed data

Let's encapsulate the data generation process within a class named SPDMatricesDataset. We consider a sample of 48 SPD matrices, with target values defined as binary {0, 1}. The create method generates the two datasets described in the previous section.

Note: Some methods and variables that are not essential for understanding the algorithms have been omitted.

class SPDMatricesDataset(object):

    def __init__(self) -> None:
        n_spd_matrices = 48
self.target = np.concatenate([ np.zeros(n_spd_matrices), np.ones(n_spd_matrices) ])


    # Generation of data sets used in comparing SVM and kNN
    # over Euclidean space and Riemannian manifold

    def create(self) -> List[np.array]:
        evals_lows = 11
     class_sep_ratio = 1.0
spd_matrices = self.__make_spd_matrices(evals_lows) return [ (spd_matrices, self.target), # Set 1 self.__make_gaussian_blobs(class_sep_ratio), # Set 2 ]

The two data sets are visualized with scatter plots using matplotlib module.
Fig. 1 Visualization of SPD matrices data set - Set 1

Fig. 2 Visualization of gaussian data set - Set 2


Finally, the method train_test_data_split extracts the training and test data from the features and target data, and encapsulate them into the SPDTrainingData  data class (see Appendix).

@staticmethod
def train_test_data_split(features: np.array, 
                                       target: np.array) -> SPDTrainingData:
    from sklearn.model_selection import train_test_split

    train_X, test_X, train_y, test_y = train_test_split(
        features,
        target,
        test_size=0.3,
        random_state=42
    )

    return SPDTrainingData(train_X, test_X, train_y, test_y)


Evaluation

The evaluation of the two metrics (Euclidean and Riemannian) for any given classifier involves two steps:
  • Calculate the classifier's score for both metrics.
  • Define the decision boundary for each of the two datasets.

Implementation

We encapsulate the evaluation of these metrics within a class named SPDMatricesClassifierTraining and scoring the classifier on the two datasets utilize the pyRiemann API [ref 8], which conveniently follows the method signatures of Scikit-learn's equivalent functions.

class SPDMatricesClassifier(object):
    def __init__(self,
                 classifier,
                 spd_metric: SPDMetric,
                 spd_training_data: SPDTrainingData) -> None:
        self.classifier = classifier                   # Target classifier (SVM,...)
        self.spd_metric = spd_metric                   # Metric (Euclidean, ...
        self.spd_training_data = spd_training_data   # Our training data

    # Train then score the given classifier using pyRiemann API
  
    def score(self) -> float:
         # 1. Select metric
        self.classifier.set_params(**{'metric': str(self.spd_metric.value)})

        # 2. Train model
        self.classifier.fit(self.spd_training_data.train_X, 
                                  self.spd_training_data.train_y)

        # 3. Score model on the test data
return self.classifier.score( self.spd_training_data.test_X, self.spd_training_data.test_y)


    @staticmethod
    @partial(np.vectorize, excluded=['clf'])
    def get_probability(cov_x: np.array, 
                                  cov_y: np.array, 
                                  cov_z: np.array, 
                                  clf):
        cov = np.array(
            [[cov_x, cov_y, 0.0, 0.0], 
             [cov_y, cov_z, 0.0, 0.0], 
             [0.0, 0.0, 0.0, 0.0], 
             [0.0, 0.0, 0.0, 0.0]]
        )
        u = cov[np.newaxis, ...]

        return clf.predict_proba(u)[0, 1]

Scoring

The score is computed on the test data and target from a sample of 48 SPD matrices

K-Nearest Neighbors (k=4)
Euclidean
             Set 1: 0.827
             Set 2: 0.655
Riemann - 
             Set 1: 0.896
             Set 2: 0.689
          

Support Vector Machine
Euclidean
             Set 1: 0.965
             Set 2: 0.697
Riemann
             Set 1: 1.000
             Set 2: 0.586

As expected, both k-Nearest Neighbors and Support Vector Machines achieve higher scores on the Riemannian manifold for SPD matrices.


Classifier decision boundary 

The objective is to evaluate the decision boundary between target values {0, 1} for classifying the two datasets using a support vector machine with both Riemannian and Euclidean metrics. The input parameters consist of 48 SPD matrices, a range [11, 15] for displaying decision boundary with a class separation of 1.0


Dataset 1 (SPD matrices)
Fig. 3 Decision Boundary Support Vector Machine - Euclidean space


Fig. 4 Decision Boundary Support Vector Machine - Riemann Manifold


Dataset 2 (Synthetic Gaussian)
Fig. 5 Decision Boundary Support Vector Machine - Euclidean space

Fig. 6 Decision Boundary Support Vector Machine - Riemann Manifold




Saturday, June 29, 2024

Fisher Information Matrix

Target audience: Advanced
Estimated reading time: 7'
The Fisher Information Matrix plays a crucial role in various aspects of machine learning and statistics. Its primary significance lies in providing a measure of the amount of information that an observable random variable carries about an unknown parameter upon which the probability depends.




       Key elements
       Use cases


What you will learn: How to estimate and visualize the Fisher information matrix for Normal and Beta distributions on a hypersphere.

Notes

  • Environments: Python  3.10.10, Geomstats 2.7.0
  • This article assumes that the reader is somewhat familiar with differential and tensor calculus [ref 1]. Please refer to our previous articles related to geometric learning listed on Appendix.
  • Source code is available at  Github.com/patnicolas/Data_Exploration/Information Geometry
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

This article is the 10th installments of our ongoing series focused on geometric learning. It introduces some basic elements of information geometry as an extension of differential geometry. As with previous articles, we utilize the Geomstats Python library [ref. 2] to implement concepts associated with geometric learning. 

NoteSummaries of my earlier articles on this topic can be found in the Appendix

As a reminder, the primary goal of learning Riemannian geometry is to understand and analyze the properties of curved spaces that cannot be described adequately using Euclidean geometry alone. 

Here is a synapsis of this article
  1. Brief introduction to information geometry
  2. Overview and mathematical formulation of the Fisher information matrix
  3. Computation of the Fisher metric to Normal and Beta distributions
  4. Implementation in Python using the Geomstats library

Information geometry

Information geometry applies the principles and methods of differential geometry to problems in probability theory and statistics [ref 3]. It studies the manifold of probability distributions and provides a natural framework for understanding and analyzing statistical models.

Key elements

  • Statistical manifolds: Families of probability distributions are considered as a manifold, with each distribution representing a point on this manifold.
  • Riemannian metrics: The Fisher information metric is commonly used to define a Riemannian metric on the statistical manifold. This metric measures the amount of information that an observable random variable carries about an unknown parameter.
  • Divergence measures: Divergence measures like the Kullback-Leibler (KL) divergence, which quantify the difference between two probability distributions.
  • Connections and curvature: Differential geometry concepts such as affine connections and curvature are used to describe the geometric properties of statistical models (i.e. α-connection family).
  • Dualistic inference: Exponential and mixture connections provide a rich structure for statistical inference.

Use cases

Here is a non-exclusive list of application of information geometry
  • Statistical Inference: Parameter estimation, hypothesis testing, and model selection (i.e. Bayesian posterior distributions and in the development of efficient sampling algorithms like Hamiltonian Monte Carlo)
  • Optimization: Natural gradient descent method uses the Fisher information matrix to adjust the learning rate dynamically, leading to faster convergence compared to traditional gradient descent.
  • Finance: Modeling uncertainties and analyzing statistical properties of financial models.
  • Machine Learning: Optimization of learning algorithms (i.e. Understanding the EM algorithm used in statistical estimation for latent variable model)
  • Neuroscience: Neural coding and information processing in the brain by modeling neural responses as probability distributions.
  • Robotics: Development of probabilistic robotics, where uncertainty and sensor noise are modeled using probability distributions.
  • Information Theory: Concepts for encoding, compression, and transmission of information.

Fisher information matrix

The Fisher information matrix is a type of Riemannian metric that can be applied to a smooth statistical manifold [ref 4]. It serves to quantify the informational difference between measurements. The points on this manifold represent probability measures defined within a Euclidean probability space, such as the Normal distribution. Mathematically, it is represented by the Hessian of the Kullback-Leibler divergence.

Let's consider a statistical manifold with coordinates (or parameters) Î¸ and its probability density functions over an interval X as follow:\[P= \left \{ p(x, \theta); \ x \in X \ \int_{R}^{} p(x, \theta) dx = 1\right \}\]The Fisher metric is a Riemann metric tensor defined as the expectation of the partial derivative of the negative log likelihood over two coordinates Î¸.\[g_{ij}(\theta) = -E\left [ \frac{\partial^2\ log\ p(x,\theta) }{\partial \theta_{i}\partial\theta_{j}} \right ] = - \int_{R}^{}{\frac{\partial^2\ log\ p(x,\theta)) }{\partial \theta_{i}\partial\theta_{j}}}p(x, \theta)dx\]

The Fisher information or Fisher-Rao metric quantifies the amount of information in the data regarding a parameter θ. The Fisher-Rao metric, an intrinsic measure, enables the analysis of a finite, n-dimensional statistical manifold M.\[ds=\sum_{i=1}^{p}{\sum_{j=1}^{p}}g_{ij}\theta^{i}\theta^{j}\]
The Fisher metric for the normal distribution Î¸ = {μ, σ} is computed as:\[\mathfrak{I}(\mu, \sigma)=-\textit{E}_{x-p}\begin{bmatrix} \frac{\partial ^2\ log\ p(\theta)}{\partial \mu^2} & \frac{\partial ^2\ log\ p(\theta)}{\partial \mu \partial \sigma} \\ \frac{\partial ^2\ log\ p(\theta)}{\partial \sigma \partial \mu} & \frac{\partial ^2\ log\ p(\theta)}{\partial \sigma^2} \end{bmatrix} = \begin{bmatrix} \sigma^{-2} & 0\\ 0 & 2\sigma^{-2} \end{bmatrix}\]
The Fisher metric for the beta distribution  Î¸ = {α, β} is computed as:\[\varphi (z)=\frac{d^2}{dz^2}\ log \ \Gamma (z)\]
\[\mathfrak{I(\alpha,\beta)}=-\textit{E}_{x-p}\begin{bmatrix} \frac{\partial ^2\ log\ p(\theta)}{\partial \alpha^2} & \frac{\partial ^2\ log\ p(\theta)}{\partial \alpha \partial \beta} \\ \frac{\partial ^2\ log\ p(\theta)}{\partial \beta \partial \alpha} & \frac{\partial ^2\ log\ p(\theta)}{\partial \beta^2} \end{bmatrix}\]
\[\mathfrak{I(\alpha,\beta)}=\begin{bmatrix} \varphi(\alpha)-\varphi(\alpha+\beta) & -\varphi(\alpha+\beta)\\ -\varphi(\alpha+\beta) & \varphi(\beta)-\varphi(\alpha+\beta) \end{bmatrix}\]

Implementation

We leverage the following classes defined in the previous articles:
Let's first define a base class for all distributions to be defined on a hypersphere [ref 5].

class GeometricDistribution(object):
    _ZERO_TGT_VEC = [0.0, 0.0, 0.0]

    def __init__(self) -> None:
        self.manifold = HypersphereSpace(True)


    def show_points(self, 
                               num_pts: int, 
                               tgt_vector: List[float] = _ZERO_TGT_VEC) -> NoReturn:
        # Random point generated on the hypersphere
        manifold_pts = self._random_manifold_points(num_pts, tgt_vector)
        
        # Exponential map used to project the tgt vector on the hypersphere
        exp_map = self.manifold.tangent_vectors(manifold_pts)

        for v, end_pt in exp_map:
            print(f'Tangent vector: {v} End point: {end_pt}')
        self.manifold.show_manifold(manifold_pts)

The purpose of the method show_points is to display the various data point with optional tangent vector on the hypersphere. The argument num_pts specifies the number of random points to be defined in the hypersphere. The tangent vector is displayed if the argument tgt_vector not defined as the origin (_ZERO_TGT_VECTOR).


Normal distribution

The class NormalHypersphere encapsulates the display of the normal distribution on the hypersphere. The constructor initialized the normal distribution implemented in the Geomstats library.
The method show_distribution display num_pdfs probability density function over a set of num_manifold_pts, manifold points on the hypersphere. This specific implementation uses only two points. The Fisher-Rao metric is computed using the metric.geodesic Geomstats method.
The metric is applied to 100 points along the geodesic between the two points A and B. Finally, the density functions, pdfs are computed by converting the metric values to the NormalDistribution.point_to_pdf Geomstats method.

class NormalHypersphere(GeometricDistribution):
   from geomstats.information_geometry.normal import NormalDistributions

    def __init__(self) -> None:
        super(NormalHypersphere, self).__init__()
        self.normal = NormalDistributions(sample_dim=1)


    def show_distribution(self, 
                                       num_pdfs: int, 
                                       num_manifold_pts: int) -> NoReturn:
        manifold_pts = self._random_manifold_points(num_manifold_pts)
        A = manifold_pts[0]
        B = manifold_pts[1]

        # Apply the Fisher metric for the two manifold points 
        # on a Hypersphere
        geodesic_ab_fisher = self.normal.metric.geodesic(A.location, 
                                                                                         B.location)
        t = gs.linspace(0, 1, 100)

        # Generate the various density functions associated to 
        # the Fisher metric between the two points on the hypersphere
        pdfs = self.normal.point_to_pdf(geodesic_ab_fisher(t))
        x = gs.linspace(0.2, 0.7, num_pdfs)

        for i in range(num_pdfs):
            plt.plot(x, pdfs(x)[i, :]/20.0)   # Normalization factor
        plt.title(f'Normal distribution on Hypersphere')
        plt.show()

Let's plot 2 randomly sampled data points associated with a tangent_vector on Hypersphere (1) then visualize 40 normalized normal probability density distributions (2).

normal_dist = NormalHypersphere()
num_points = 2
tangent_vector = [0.4, 0.7, 0.2]

         # 1. Display the 2 data points on the hypersphere
num_manifold_pts = normal_dist.show_points(num_points, 
                                                                          tangent_vector)

        # 2. Visualize the 40 normal probabilities density functions
num_pdfs = 40
succeeded = normal_dist.show_distribution(num_pdfs, num_points)

Fig. 1 Two random data points on a Hypersphere with their tangent vectors 



Fig. 2 Visualization of Normal distribution between two random points on a hypersphere


Beta distribution

Let's wrap the evaluation of the Beta distribution on a hypersphere into the class BetaHypersphere that inherits GeometriDistribution. It leverages the BetaDistributions class in Geomstats. 

class BetaHypersphere(GeometricDistribution):
   from geomstats.information_geometry.beta import BetaDistributions
    
   def __init__(self) -> None:
        super(BetaHypersphere, self).__init__()
        self.beta = BetaDistributions()


    def show_distribution(self, 
                                       num_manifold_pts: int, 
                                      num_interpolations: int) -> NoReturn:

        # 1. Generate random points on Hypersphere -Von Mises algorithm
        manifold_pts = self._random_manifold_points(num_manifold_pts)
        t = gs.linspace(0, 1.1, num_interpolations)[1:]
        # 2. Define the beta pdfs associated with each
        beta_values_pdfs = [self.beta.point_to_pdf(manifold_pt.location)(t) 
                                          for manifold_pt in manifold_pts]

        # 3. Generate, normalize and display each Beta distribution
        for beta_values in beta_values_pdfs:
            min_beta = min(beta_values)
            delta_beta = max(beta_values) - min_beta
            y = [(beta_value - min_beta)/delta_beta  
                      for beta_value in beta_values]
            plt.plot(t, y)
        plt.title(f'Beta distribution on Hypersphere')
        plt.show()


The method show_distribution generates random points on the Hypersphere (1)  and compute the beta density function at these points using the Geomstats BetaDistributions.point_to_pdf (2).
The values generated by the pdfs are normalized then plotted (3)


Let's plot 10 randomly sampled data points on Hypersphere (1) then visualize 200 normalized beta probability density distributions (2).

beta_dist = BetaHypersphere()
        
num_interpolations = 200
num_manifold_pts = 10
    # 1. Display the 10 data points on the hypersphere
beta_dist.show_points(num_manifold_pts)

   # 2. Visualize the probabilities density functions with 
   #     interpolation points
succeeded = beta_dist.show_distribution(num_manifold_pts, 
                                                                  num_interpolations)
Fig. 3  10 random data points with on a Hypersphere


Fig. 4 Visualization of Beta distributions associated with 10 data points on hypersphere