Sunday, December 15, 2024

Approximating PCA on Manifolds

  Target audience: Advanced
Estimated reading time: 5'

 

Have you ever wondered how to perform Principal Component Analysis on manifolds? An approximate solution relies on the fact that tangent spaces are locally Euclidean.

Table of contents

       Principal components analysis (PCA)
       Components
       Computation
Follow me on LinkedIn

What you will learn: How to approximate PCA on a manifold using Eigen decomposition (orthonormal basis) on the tangent space.

Notes

  • Environments: Python 3.12.5,  GeomStats 2.8.0, Matplotlib 3.9, Numpy 2.2.0, Scikit-learn 1.6.0 
  • Source code is available on GitHub [ref 1]
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statement.

Introduction

This post is a follow-up to the Fréchet Centroid on Manifolds article [ref 2]

Principal Components Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique in machine learning that simplifies large datasets into smaller sets while preserving key patterns and trends.

The PCA process can be summarized in five steps [ref 3]:
  1. Standardize the Data: Normalize the range of the original continuous variables.  
  2. Calculate the Covariance Matrix: Identify correlations between variables.  
  3. Determine Eigenvectors and Eigenvalues: Compute these from the covariance matrix to identify the principal components.  
  4. Form the Feature Vector: Select the principal components to retain.
  5. Reproject the Data: Transform the data along the axes of the principal components.
Fig. 1 Visualization of principal components in 3D Euclidean space



Note: These five steps are implemented (and abstracted) in the Scikit-learn Python library within the decomposition.PCA class.

PCA has some serious limitations: 
  • Assumption of linear relationshipPCA only captures linear correlations between features.
  • Sensitivity to scaling: Features with larger ranges dominate the principal components.
  • Sensitivity to noise: Noise and outliers can distort the features set
  • Sensitivity to missing data: It requires some level of amputation
  • Assumption of Gaussian distribution
  • Unsupervised learning: It ignores class labels making unsuitable to classification
  • Assumption of feature importance: All features have the same importance or weights.

PCA for manifolds

The representation of points on the manifold as tangent vectors at a reference point to fit any machine learning algorithm. The principal components analysis is no exception.

The tangent space is a linear approximation of the manifold at a specific point (usually the Fréchet centroid of the data on the manifold [ref 2] ).
By projecting data onto the tangent space, PCA can be applied as if the data were in Euclidean space.

Fig. 2 Visualization of principal components in locally Euclidean tangent plane


PCA on the tangent space provides a way to represent manifold-valued data in a lower-dimensional, Euclidean space for tasks such as visualization, clustering, or regression. It enables the application of statistical analysis tools and tests requiring a Euclidean space.

This approach has numerous benefits
  • It allows handling of non-Euclidean data
  • It linearizes the manifold locally, reducing the computation complexity entailed by a manifold
  • The covariance is intrinsic to the manifold.

Implementation

Components

The goal is to develop a representation and programming framework capable of handling both linear PCA (Euclidean) and locally linear PCA (Manifold).


Let's begin by defining a class, PrincipalComponents, to represent components for both Euclidean spaces and manifolds.
The class attributes include:
  • base_point: A point on the manifold corresponding to the tangent space, locally Euclidean, where the principal components are calculated.
  • Components: A NumPy array representing the principal components.

@dataclass
class PrincipalComponents:
    base_point: np.array
    components: np.array

Next, we define a class, ManifoldPCA, which supports Principal Component Analysis (PCA) for manifolds embedded in a 3D ambient Euclidean space.

The class includes two attributes:
  • space: Represents the space type, which can either inherit from `Manifold` (with a locally Euclidean tangent space at a base point) or be `Euclidean` (globally Euclidean, where the base point can be any point in 3D space).
  • tgt_pca: Specifies the PCA computation class, which can be either Geomstats tangentPCA or Scikit-learn's  PCA class.

The three main methods are:
  • estimate: Computes the principal components for a set of points on the manifold.  
  • project: Projects points on the manifold along the principal components (eigenvector transformation).  
  • geodesics: Visualizes the curves on the hypersphere associated with the principal components.
class ManifoldPCA(object):
    def __init__(self, space: Manifold) -> None:
        from geomstats.learning.pca import TangentPCA
        from geomstats.geometry.euclidean import Euclidean
        from sklearn.decomposition import PCA
 

        self.tgt_pca = TangentPCA(space, n_components=2if not is instance(space, Euclidean) \
                              else PCA(n_components=3)
        self.space = space

    def estimate(self, 
                         manifold_pts: np.array, 
                         base_point: Optional[np.array] = None) -> PrincipalComponents:

    def project(self, manifold_pts: np.array) -> np.array:

    def geodesics(self, principal_components: PrincipalComponents) -> List[np.array]:



Computation

The estimate method delegates the computation of principal components to either the tangent space (__tangent_pca) or the 3D Euclidean space (__euclidean_pca), depending on the context. It uses the Geomstats TangentPCA.fit [ref 4method for manifolds or the Scikit-learn PCA.fit [ref 5] method for Euclidean spaces.

In this article we use the Hypersphere [ref 6] as our test manifold.

For PCA in the tangent space, the Fréchet centroid is used as the base point if none is provided. In the 3D Euclidean space, the arithmetic mean serves as the base point.

As expected, PCA in the Euclidean space produces three-dimensional principal components, whereas PCA in the locally Euclidean tangent space results in two-dimensional principal components.

def estimate(self,
                     manifold_pts: np.array,
                     base_point: Optional[np.array] = None) -> PrincipalComponents:
    from geomstats.geometry.euclidean import Euclidean

    return self .__tangent_pca(manifold_pts, base_point) if not isinstance(self.space, Euclidean)  \  \
              else self.__euclidean_pca(manifold_pts, base_point)



"""  Compute the principal components on the tangent of the manifold """

def __tangent_pca(self,
                               manifold_pts: np.array,
                               base_point: Optional[np.array] = None) -> PrincipalComponents:
        from geomstats.learning.frechet_mean import FrechetMean

        # If no base points have been defined ... then use the centroid
        if base_point is None:
            frechet_mean = FrechetMean(self.space)
            frechet_mean.fit(X=manifold_pts, y=None)
            base_point = frechet_mean.estimate_

        # Compute the 2-dimension principal components in intrinsic coordinates
        self.pca.fit(manifold_pts, base_point=base_point)
        return PrincipalComponents(base_point, self.pca.components_)



"""  Compute the principal components on the Euclidean space """

def __euclidean_pca(self,
                                  manifold_pts: np.array,
                                  base_point: Optional[np.array] = None) -> PrincipalComponents:
        
        self.pca.fit(manifold_pts)
        mean = np.mean(manifold_pts, axis=0) if base_point is None else base_point
        
        return PrincipalComponents(mean, self.pca.components_)


The principal components are directions and constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. The method project redefine the values of the data points in relation of the new orthonormal basis.

def project(self, manifold_pts: np.array) -> np.array:
    return self.tgt_pca.transform(manifold_pts)

The geodesics corresponding to each orthonormal basis are generated for visualization purposes.

def geodesics(self, principal_components PrincipalComponents) -> List[np.array]:
     import geomstats.backend as gs

     # For now, we generate geodesics only for spheres
     if not isinstance(self.space, Hypersphere):
          raise GeometricException(f'Cannot extract geodesics from this type of manifold {str(self.space)}')

     # Extract the geodesic for each of the principal components
     base_point = principal_components.base_point
     geodesics = [self.space.metric.geodesic(initial_point=base_point, initial_tangent_vec=component)
                          for component in principal_components.omponents]
        
     # Generate a trace of 200 data points for visualization purpose
     trace = gs.linspace(-1.0, 1.0, 200)
     geodesic_traces = [geodesic(trace) for geodesic in geodesics]

     return geodesic_traces


Evaluation

The goal of this evaluation is to compute and visualize the principal components for a set of data points on a hypersphere [ref ], both:
  • On the tangent plane at the Fréchet mean using Geomstats.
  • In the surrounding 3D Euclidean space using Scikit-learn.
The steps are explained in the comments within the following code snippet.

# 1. Generate random points on the hypersphere
sphere = Hypersphere(dim=2)
X = sphere.random_uniform(n_samples=8)

# 2. Perform principal components analysis on tangent plane of the hypersphere
manifold_pca = ManifoldPCA(sphere)
principal_components = manifold_pca.estimate(X)
print(f'\nPrincipal components for Hypersphere:\n{principal_components.components}')

# 3. Project the feature set on the 2D tangent plane
projected_points = manifold_pca.project(X)
print(f'\nProjected points on tangent space:\n{projected_points}')

# 4. Create and visualize the geodesics associated to the principal components
geodesic_components = manifold_pca.geodesics(principal_components)

# 5. Visualization 
hypersphere_plot = HyperspherePlot(X, principal_components.base_point)
hypersphere_plot.show(geodesic_components)

# 6. Apply linear PCA to the data points on hypershoer
euclidean_pca = ManifoldPCA(Euclidean(3))
principal_components = euclidean_pca.estimate(X)
print(f'\nPrincipal components in Euclidean space:\n{principal_components.components}')

# 7. Project the features set on the Euclidean space
projected_points = euclidean_pca.project(X)
print(f'\nProjected points in Euclidean space:\n{projected_points}')


PCA for hypersphere data on tangent space

Output:
Principal components for Hypersphere:
[[ 0.37805 -0.21186 -0.90121]
 [ 0.74583  0.64640  0.16091]]

Projected points on tangent space:
[[ 0.28557 -0.58902]
 [-0.98468  0.58031]
 [-0.69088  0.13017]
 [-1.16979 -0.79723]
 [-0.49557 -0.04852]
 [-0.08720  0.82096]
 [ 1.64977  0.56592]
 [ 1.49278 -0.66260]]

The following 3D plot shows the 8 random points on the hypersphere, along with the two geodesics corresponding to the first and second principal components on the tangent plane, as well as the Fréchet centroid.

Fig. 3 Visualization of random points on a 3D sphere, along with 
geodesics PCA and Fréchet mean


The manifold points are projected on the tangent plane then transformed along the two orthonormal basis.

Fig. 4 Visualization of hypersphere data point projected along the 2 orthonormal bases 
on the tangent plane


PCA for hypersphere data on Euclidean space

Principal components in Euclidean space:
[[-0.07085  0.00221  0.99742 ]
 [ 0.79399  0.60542  0.05505]
 [-0.60377  0.79589 -0.04465]]

Projected points in Euclidean space:
[[-0.25303 -0.50401 -0.49961]
 [ 0.68800  0.36335  0.29347]
 [ 0.64130  0.04860 -0.10587]
 [ 0.51994 -0.65668  0.48229]
 [ 0.51764 -0.09482 -0.28160]
 [ 0.15862  0.71983 -0.18580]
 [-1.12479  0.42143  0.22098]
 [-1.14769 -0.29770  0.07614 ]]


This final 3D plot illustrates the original 8 data points on the hypersphere, transformed along the three orthonormal bases in Euclidean space.

Fig. 5 Visualization of hypersphere data point projected along the 3 orthonormal bases 
in the Euclidean space


Dimension reduction alternatives

There are several alternative algorithms dedicated to non-linear dimension reduction among them:

Kernel Principal Component Analysis (Kernel PCA) extends Principal Component Analysis (PCA) to handle data that is not linearly separable. By applying a kernel function to map the data into a higher-dimensional space, Kernel PCA reveals complex underlying structures in the data [ref 7].

t-SNE is an unsupervised, non-linear technique designed for exploring and visualizing high-dimensional data. It provides insights into how data is structured in higher dimensions by projecting it into two or three dimensions [ref 8].

Uniform Manifold Approximation and Projection (UMAP): leverages research done for Laplacian eigenmaps to tackle the problem of uniform data distributions on manifolds by combining Riemannian geometry with advances in fuzzy simplicial sets [ref 9].

Locally Linear Embedding (LLE) aims to create a lower-dimensional representation of data while preserving the distances within local neighborhoods. It can be viewed as performing a series of local Principal Component Analyses, which are then combined globally to determine the optimal non-linear embedding [ref 10].


References






-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 and 
Geometric Learning in Python Newsletter on LinkedIn.

Tuesday, December 10, 2024

Fréchet Centroid on Manifolds in Python

  Target audience: Intermediate
Estimated reading time: 5'


The Fréchet centroid (or intrinsic centroid) is a generalization of the concept of a mean to data points that lie on a manifold or in a non-Euclidean space. It minimizes a similar quantity defined using the intrinsic geometry of the manifold.


Follow me on LinkedIn

What you will learn: How to compute the Frechet centroid (or mean) of multiple point on a data manifold.

Notes

  • Environments: Python 3.12.5,  GeomStats 2.8.0, Matplotlib 3.9, Numpy 2.2.0
  • Source code is available on GitHub [ref 1]
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statement.


Introduction

For readers unfamiliar with manifolds and basic differential geometry, I highly recommend starting with my two introductory posts:

  1. Foundation of Geometric Learning This post introduces differential geometry in the context of machine learning, outlining its foundational components.
  2. Differentiable Manifolds for Geometric Learning: This post explores manifold concepts such as tangent vectors and geodesics, with Python implementations for the hypersphere using the Geomstats library.
Alternatively, I strongly encourage readers to consult tutorials [ref 2], video series [ref 3], or publications [ref 4, 5] for more in-depth information.

The following image illustrates the difference between the Euclidean and  Fréchet  means given a manifold.


Fig. 1. Illustration of Euclidean and Frechet means on an arbitrary manifold


Let's consider n tensors x[i] with d values, the Euclidean mean is computed as
\[\mathbf{\mu} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{x}_{i} \] with for each  coordinate/value of index j: \[\mu^{(j)}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{(j)} \] 

For a manifold M and a metric-distance d the  Fréchet centroid is defined as \[ m=arg\displaystyle \min_{p \in M}\sum_{i=1}^{n}d^{2}\left(p, \mathbf{x_{i}} \right) \] and the weighted Frechet mean as: \[m=arg\displaystyle \min_{p \in M}\sum_{i=1}^{n}w_{i}d^{2}\left(p, \mathbf{x_{i}} \right) \]


Implementation

Let's explore the computation of means from the perspective of extrinsic geometry, focusing on surfaces embedded within a three-dimensional Euclidean space.
To begin, we'll define a class `FrechetEstimator` to encapsulate the essential components of the estimator:
  • space: A smooth manifold (e.g., Sphere, Hyperbolic) or a Lie group (e.g., SO3, SE3).  
  • optimizer: An optimization algorithm used to iterate through the space and minimize the sum of squared distances.  
  • weights: Optional weights that can be applied during the mean estimation process.

Note: The constructor invoke the Geomstats class FrechetMean associated to the given manifold


class FrechetEstimator(object):
    def __init__(self, space: Manifold, optimizer: BaseGradientDescent, weights: Tensor = None) -> None:
        self.frechet_mean = FrechetMean(space)
        self.frechet_mean.optimizer = optimizer
        self.weights = weights
        self.space = space

    def estimate(self, X: List[np.array]) -> np.array:
    
    def rand(self, num_samples: int) -> List[np.array]:


The `rand` method generates random points on the manifold for evaluation purposes, utilizing the `random_uniform` function from the Geomstats library.

The implementation ensures that:
- The manifold type is supported.  
- All randomly generated points (as tensors) are verified to belong to the manifold.

def rand(self, num_samples: int) -> np.array:
    from geomstats.geometry.hypersphere import Hypersphere
    from geomstats.geometry.special_orthogonal import _SpecialOrthogonalMatrices

    # Test if this manifold is supported
    if not (isinstance(self.space, Hypersphere) or 
              isinstance(self.space, _SpecialOrthogonalMatrices)):
        raise GeometricException('Cannot generate random values on unsupported manifold')

    X = self.space.random_uniform(num_samples)
    # Validate the randomly generated belongs to the manifold 'self.space'
    are_points_valid = all([self.space.belongs(x) for x in X])
    if not are_points_valid:
        raise GeometricException('Some generated points do not belong to the manifold')
    return X

Our evaluation uses the Euclidean mean as a baseline, with a straightforward implementation using NumPy.


def euclidean_mean(manifold_points: List[Tensor]) -> np.array:
      return np.mean(manifold_points, axis=0)


The computation of the Fréchet centroid on a sequence of data points defined as stacked numpy arrays, is implemented in method, estimate. It relies on the Geomstats method FrechetMean.fit

def estimate(self, X: np.array) -> np.array:
    self.frechet_mean.fit(X=X, y=None, weights=self.weights)
    return self.frechet_mean.estimate_

Evaluation

Centroid on Sphere (S2)

Our initial test involves calculating the non-weighted Fréchet centroid for 7 points located on a hypersphere [ref 6]. The randomly generated points, `rand_points`, are visualized both on a 3D sphere and in Euclidean space using the `HyperspherPlot` and `EuclideanPlot` classes. While the code is not included here for clarity, it is available on GitHub [ref 1].


frechet_estimator = FrechetEstimator(space=Hypersphere(dim=2),
                                                            optimizer=GradientDescent(), 
                                                            weights=None)
# 1- Generate the random point on the hypersphere
rand_points = frechet_estimator.rand(8)

# 2- Estimate the Frechet centroid then test if belongs to the manifold
frechet_mean = frechet_estimator.estimate(rand_points)
assert frechet_estimator.space.belongs(frechet_mean)
# 3- Display the points on the Hypersphere on plot hypersphere_plot = HyperspherePlot(rand_points,  
frechet_mean)
hypersphere_plot.show()

# 4- Compute the euclidean or arithmetic centroid
euclidean_mean = FrechetEstimator.euclidean_mean(np_points)

# 5- Display the points in 3D Euclidean space
euclidean_plot = EuclideanPlot(rand_points, euclidean_mean
euclidean_plot.show()

print(f'\nFrechet mean:   {frechet_mean}\nEuclidean mean: {euclidean_mean}')


Output:

Frechet mean:     [ 0.0174    -0.9958   -0.0897]

Euclidean mean: [ 0.0220    -0.0792     0.0226]


The 8 random points are visualized using Matplotlib 3D scatter plot.


'

Fig 2. Visualization of points from a 2-dimension hypersphere on 3D Euclidean space


The next graph visualizes the 8 random points on a 3D Sphere.

Fig 3. Visualization of random points on a 2-dimension hypersphere



Centroid on Special Orthogonal Group (SO3)

A special Orthogonal Group is a Lie group. In differential geometry, a Lie group is a mathematical structure that combines the properties of both a group and a smooth manifold. It allows for the application of both algebraic and geometric techniques. As a group, it has an operation (like multiplication) that satisfies certain axioms (closure, associativity, identity, and invertibility).
The Special Orthogonal Group in 3 dimensions, SO(3) is the group of all rotation matrices in 3 spatial dimensions.
It can be defined by 3 rotation elements for each of the axis of rotation x, y, and z.

Important note: The SO(3) manifold is described and evaluated with a Python source code in a previous post [ref 7].


We utilize the Geomstats `SpecialOrthogonal` class to generate 4 random matrices, then compute their Euclidean and Fréchet centroids.

manifold = SpecialOrthogonal(n=3, point_type="matrix")
frechet_estimator = FrechetEstimator(manifold, GradientDescent(), weights=None)

# Generate the random point on SO3
manifold_points = frechet_estimator.rand(4)

# Visualize the SO3 matrices on 3D plot
so3_plot = SO3Plot(manifold_points) so3_plot.show()

# Compute the Frechet and Euclidean centroids
frechet_mean = frechet_estimator.estimate(manifold_points) euclidean_mean = FrechetEstimator.euclidean_mean(manifold_points) print(f'\nFrechet mean:\n{frechet_mean}\nEuclidean mean:\n{euclidean_mean}')

Output:
Frechet mean:   
[[-0.74841 -0.40675 -0.52385]
 [-0.01445 -0.77966  0.62603]
 [-0.66306  0.47610  0.57763]]
Euclidean mean: 
[[-0.52282 -0.24432 -0.38322]
 [ 0.03642 -0.55841  0.37230]
 [-0.44746  0.26312  0.11775]]

In this scenario we represent 4 asymmetric 3x3 matrices representing the SO(3) rotations on a 3D plots.

Fig 4. Visualization of 4 random SO3 matrices on a 3D plot

References

[4Vector and Tensor Analysis with Applications -  A. I. Borisenko, I. E. Tarapov - Dover Books on MathenaticsPublications 1979 
[5A Student's Guide to Vectors and Tensors - D. Fleisch - Cambridge University Press - 2008





-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 and 
Geometric Learning in Python Newsletter on LinkedIn.