Showing posts with label PCA. Show all posts
Showing posts with label PCA. Show all posts

Sunday, December 15, 2024

Approximating PCA on Manifolds

  Target audience: Advanced
Estimated reading time: 5'

 

Have you ever wondered how to perform Principal Component Analysis on manifolds? An approximate solution relies on the fact that tangent spaces are locally Euclidean.

Table of contents

       Principal components analysis (PCA)
       Components
       Computation
Follow me on LinkedIn

What you will learn: How to approximate PCA on a manifold using Eigen decomposition (orthonormal basis) on the tangent space.

Notes

  • Environments: Python 3.12.5,  GeomStats 2.8.0, Matplotlib 3.9, Numpy 2.2.0, Scikit-learn 1.6.0 
  • Source code is available on GitHub [ref 1]
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statement.

Introduction

This post is a follow-up to the Fréchet Centroid on Manifolds article [ref 2]

Principal Components Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique in machine learning that simplifies large datasets into smaller sets while preserving key patterns and trends.

The PCA process can be summarized in five steps [ref 3]:
  1. Standardize the Data: Normalize the range of the original continuous variables.  
  2. Calculate the Covariance Matrix: Identify correlations between variables.  
  3. Determine Eigenvectors and Eigenvalues: Compute these from the covariance matrix to identify the principal components.  
  4. Form the Feature Vector: Select the principal components to retain.
  5. Reproject the Data: Transform the data along the axes of the principal components.
Fig. 1 Visualization of principal components in 3D Euclidean space



Note: These five steps are implemented (and abstracted) in the Scikit-learn Python library within the decomposition.PCA class.

PCA has some serious limitations: 
  • Assumption of linear relationshipPCA only captures linear correlations between features.
  • Sensitivity to scaling: Features with larger ranges dominate the principal components.
  • Sensitivity to noise: Noise and outliers can distort the features set
  • Sensitivity to missing data: It requires some level of amputation
  • Assumption of Gaussian distribution
  • Unsupervised learning: It ignores class labels making unsuitable to classification
  • Assumption of feature importance: All features have the same importance or weights.

PCA for manifolds

The representation of points on the manifold as tangent vectors at a reference point to fit any machine learning algorithm. The principal components analysis is no exception.

The tangent space is a linear approximation of the manifold at a specific point (usually the Fréchet centroid of the data on the manifold [ref 2] ).
By projecting data onto the tangent space, PCA can be applied as if the data were in Euclidean space.

Fig. 2 Visualization of principal components in locally Euclidean tangent plane


PCA on the tangent space provides a way to represent manifold-valued data in a lower-dimensional, Euclidean space for tasks such as visualization, clustering, or regression. It enables the application of statistical analysis tools and tests requiring a Euclidean space.

This approach has numerous benefits
  • It allows handling of non-Euclidean data
  • It linearizes the manifold locally, reducing the computation complexity entailed by a manifold
  • The covariance is intrinsic to the manifold.

Implementation

Components

The goal is to develop a representation and programming framework capable of handling both linear PCA (Euclidean) and locally linear PCA (Manifold).


Let's begin by defining a class, PrincipalComponents, to represent components for both Euclidean spaces and manifolds.
The class attributes include:
  • base_point: A point on the manifold corresponding to the tangent space, locally Euclidean, where the principal components are calculated.
  • Components: A NumPy array representing the principal components.

@dataclass
class PrincipalComponents:
    base_point: np.array
    components: np.array

Next, we define a class, ManifoldPCA, which supports Principal Component Analysis (PCA) for manifolds embedded in a 3D ambient Euclidean space.

The class includes two attributes:
  • space: Represents the space type, which can either inherit from `Manifold` (with a locally Euclidean tangent space at a base point) or be `Euclidean` (globally Euclidean, where the base point can be any point in 3D space).
  • tgt_pca: Specifies the PCA computation class, which can be either Geomstats tangentPCA or Scikit-learn's  PCA class.

The three main methods are:
  • estimate: Computes the principal components for a set of points on the manifold.  
  • project: Projects points on the manifold along the principal components (eigenvector transformation).  
  • geodesics: Visualizes the curves on the hypersphere associated with the principal components.
class ManifoldPCA(object):
    def __init__(self, space: Manifold) -> None:
        from geomstats.learning.pca import TangentPCA
        from geomstats.geometry.euclidean import Euclidean
        from sklearn.decomposition import PCA
 

        self.tgt_pca = TangentPCA(space, n_components=2if not is instance(space, Euclidean) \
                              else PCA(n_components=3)
        self.space = space

    def estimate(self, 
                         manifold_pts: np.array, 
                         base_point: Optional[np.array] = None) -> PrincipalComponents:

    def project(self, manifold_pts: np.array) -> np.array:

    def geodesics(self, principal_components: PrincipalComponents) -> List[np.array]:



Computation

The estimate method delegates the computation of principal components to either the tangent space (__tangent_pca) or the 3D Euclidean space (__euclidean_pca), depending on the context. It uses the Geomstats TangentPCA.fit [ref 4method for manifolds or the Scikit-learn PCA.fit [ref 5] method for Euclidean spaces.

In this article we use the Hypersphere [ref 6] as our test manifold.

For PCA in the tangent space, the Fréchet centroid is used as the base point if none is provided. In the 3D Euclidean space, the arithmetic mean serves as the base point.

As expected, PCA in the Euclidean space produces three-dimensional principal components, whereas PCA in the locally Euclidean tangent space results in two-dimensional principal components.

def estimate(self,
                     manifold_pts: np.array,
                     base_point: Optional[np.array] = None) -> PrincipalComponents:
    from geomstats.geometry.euclidean import Euclidean

    return self .__tangent_pca(manifold_pts, base_point) if not isinstance(self.space, Euclidean)  \  \
              else self.__euclidean_pca(manifold_pts, base_point)



"""  Compute the principal components on the tangent of the manifold """

def __tangent_pca(self,
                               manifold_pts: np.array,
                               base_point: Optional[np.array] = None) -> PrincipalComponents:
        from geomstats.learning.frechet_mean import FrechetMean

        # If no base points have been defined ... then use the centroid
        if base_point is None:
            frechet_mean = FrechetMean(self.space)
            frechet_mean.fit(X=manifold_pts, y=None)
            base_point = frechet_mean.estimate_

        # Compute the 2-dimension principal components in intrinsic coordinates
        self.pca.fit(manifold_pts, base_point=base_point)
        return PrincipalComponents(base_point, self.pca.components_)



"""  Compute the principal components on the Euclidean space """

def __euclidean_pca(self,
                                  manifold_pts: np.array,
                                  base_point: Optional[np.array] = None) -> PrincipalComponents:
        
        self.pca.fit(manifold_pts)
        mean = np.mean(manifold_pts, axis=0) if base_point is None else base_point
        
        return PrincipalComponents(mean, self.pca.components_)


The principal components are directions and constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. The method project redefine the values of the data points in relation of the new orthonormal basis.

def project(self, manifold_pts: np.array) -> np.array:
    return self.tgt_pca.transform(manifold_pts)

The geodesics corresponding to each orthonormal basis are generated for visualization purposes.

def geodesics(self, principal_components PrincipalComponents) -> List[np.array]:
     import geomstats.backend as gs

     # For now, we generate geodesics only for spheres
     if not isinstance(self.space, Hypersphere):
          raise GeometricException(f'Cannot extract geodesics from this type of manifold {str(self.space)}')

     # Extract the geodesic for each of the principal components
     base_point = principal_components.base_point
     geodesics = [self.space.metric.geodesic(initial_point=base_point, initial_tangent_vec=component)
                          for component in principal_components.omponents]
        
     # Generate a trace of 200 data points for visualization purpose
     trace = gs.linspace(-1.0, 1.0, 200)
     geodesic_traces = [geodesic(trace) for geodesic in geodesics]

     return geodesic_traces


Evaluation

The goal of this evaluation is to compute and visualize the principal components for a set of data points on a hypersphere [ref ], both:
  • On the tangent plane at the Fréchet mean using Geomstats.
  • In the surrounding 3D Euclidean space using Scikit-learn.
The steps are explained in the comments within the following code snippet.

# 1. Generate random points on the hypersphere
sphere = Hypersphere(dim=2)
X = sphere.random_uniform(n_samples=8)

# 2. Perform principal components analysis on tangent plane of the hypersphere
manifold_pca = ManifoldPCA(sphere)
principal_components = manifold_pca.estimate(X)
print(f'\nPrincipal components for Hypersphere:\n{principal_components.components}')

# 3. Project the feature set on the 2D tangent plane
projected_points = manifold_pca.project(X)
print(f'\nProjected points on tangent space:\n{projected_points}')

# 4. Create and visualize the geodesics associated to the principal components
geodesic_components = manifold_pca.geodesics(principal_components)

# 5. Visualization 
hypersphere_plot = HyperspherePlot(X, principal_components.base_point)
hypersphere_plot.show(geodesic_components)

# 6. Apply linear PCA to the data points on hypershoer
euclidean_pca = ManifoldPCA(Euclidean(3))
principal_components = euclidean_pca.estimate(X)
print(f'\nPrincipal components in Euclidean space:\n{principal_components.components}')

# 7. Project the features set on the Euclidean space
projected_points = euclidean_pca.project(X)
print(f'\nProjected points in Euclidean space:\n{projected_points}')


PCA for hypersphere data on tangent space

Output:
Principal components for Hypersphere:
[[ 0.37805 -0.21186 -0.90121]
 [ 0.74583  0.64640  0.16091]]

Projected points on tangent space:
[[ 0.28557 -0.58902]
 [-0.98468  0.58031]
 [-0.69088  0.13017]
 [-1.16979 -0.79723]
 [-0.49557 -0.04852]
 [-0.08720  0.82096]
 [ 1.64977  0.56592]
 [ 1.49278 -0.66260]]

The following 3D plot shows the 8 random points on the hypersphere, along with the two geodesics corresponding to the first and second principal components on the tangent plane, as well as the Fréchet centroid.

Fig. 3 Visualization of random points on a 3D sphere, along with 
geodesics PCA and Fréchet mean


The manifold points are projected on the tangent plane then transformed along the two orthonormal basis.

Fig. 4 Visualization of hypersphere data point projected along the 2 orthonormal bases 
on the tangent plane


PCA for hypersphere data on Euclidean space

Principal components in Euclidean space:
[[-0.07085  0.00221  0.99742 ]
 [ 0.79399  0.60542  0.05505]
 [-0.60377  0.79589 -0.04465]]

Projected points in Euclidean space:
[[-0.25303 -0.50401 -0.49961]
 [ 0.68800  0.36335  0.29347]
 [ 0.64130  0.04860 -0.10587]
 [ 0.51994 -0.65668  0.48229]
 [ 0.51764 -0.09482 -0.28160]
 [ 0.15862  0.71983 -0.18580]
 [-1.12479  0.42143  0.22098]
 [-1.14769 -0.29770  0.07614 ]]


This final 3D plot illustrates the original 8 data points on the hypersphere, transformed along the three orthonormal bases in Euclidean space.

Fig. 5 Visualization of hypersphere data point projected along the 3 orthonormal bases 
in the Euclidean space


Dimension reduction alternatives

There are several alternative algorithms dedicated to non-linear dimension reduction among them:

Kernel Principal Component Analysis (Kernel PCA) extends Principal Component Analysis (PCA) to handle data that is not linearly separable. By applying a kernel function to map the data into a higher-dimensional space, Kernel PCA reveals complex underlying structures in the data [ref 7].

t-SNE is an unsupervised, non-linear technique designed for exploring and visualizing high-dimensional data. It provides insights into how data is structured in higher dimensions by projecting it into two or three dimensions [ref 8].

Uniform Manifold Approximation and Projection (UMAP): leverages research done for Laplacian eigenmaps to tackle the problem of uniform data distributions on manifolds by combining Riemannian geometry with advances in fuzzy simplicial sets [ref 9].

Locally Linear Embedding (LLE) aims to create a lower-dimensional representation of data while preserving the distances within local neighborhoods. It can be viewed as performing a series of local Principal Component Analyses, which are then combined globally to determine the optimal non-linear embedding [ref 10].


References






-------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 and 
Geometric Learning in Python Newsletter on LinkedIn.

Wednesday, June 12, 2024

Principal Geodesic Analysis

Target audience: Advanced
Estimated reading time: 7'

Principal Component Analysis (PCA) is essential for dimensionality and noise reduction, feature extraction, and anomaly detection. However, its effectiveness is limited by the strong assumption of linearity. 
Principal Geodesic Analysis (PGA) addresses this limitation by extending PCA to handle non-linear data that lies on a lower-dimensional manifold.


Table of contents
       Tangent PCA
       Setup
       Euclidean PCA
Follow me on LinkedIn

What you will learn: How to implement principal geodesic analysis as extension of principal components analysis on a manifolds tangent space.

Notes

  • Environments: Python  3.10.10, Geomstats 2.7.0, Scikit-learn 1.4.2
  • This article assumes that the reader is somewhat familiar with differential and tensor calculus [ref 1]. Please refer to our previous articles related to geometric learning [ref 234].
  • Source code is available at  Github.com/patnicolas/Data_Exploration/manifolds
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

This article is the ninth installments of our ongoing series focused on geometric learning. As with previous articles, we utilize the Geomstats Python library [ref5] to implement concepts associated with geometric learning.

NoteSummaries of my earlier articles on this topic can be found in the Appendix

As a reminder, the primary goal of learning Riemannian geometry is to understand and analyze the properties of curved spaces that cannot be described adequately using Euclidean geometry alone. 

This article revisits the widely used unsupervised learning technique, Principal Component Analysis (PCA), and its counterpart in non-Euclidean space, Principal Geodesic Analysis (PGA).
The content of this article is as follows:
  1. Brief recap of PCA
  2. Overview of key components of differential geometry
  3. Introduction to PCA on tangent space using the logarithmic map
  4. Implementation in Python using the Geomstats library

   Principal components

    Principal component analysis

Principal Component Analysis (PCA) is a technique for reducing the dimensionality of a large dataset, simplifying it into a smaller set of components while preserving important patterns and trends. The goal is to reduce the number of variables of a data set, while preserving as much information as possible.

For a n-dimensional data, PCA tries to put maximum possible information in the first component c0, then maximum remaining information in the second c1 and so on, until having something like shown in the scree plot below.
Finally, we select the p << n top first components such as:\[\left \{ c_{i} :\ \ \sum_{i=1}^{p} c_{i} < T \ and \ \ c_{1} \geqslant c_{2} .... \geqslant c_{p} \right \}\]
The principal components are actually the eigenvectors of the Covariance matrix.
The first thing to understand about eigenvectors and eigenvalues is that they always appear in pairs, with each eigenvector corresponding to an eigenvalue. Additionally, the number of these pairs matches the number of dimensions in the data.

For instance, in a 3-dimension space, the eigenvalues are extracted from the following 3x3 symmetric Covariance matrix:\[\begin{bmatrix} cov(x, x)) & cov(x,y) & cov(x, z)\\ cov(x, y) & cov(y,y) & cov(y,z) \\ cov(x, z) & cov(y, z) & cov(z, z) \end{bmatrix}\] as cov(a, b) = cov(b, a).
The eigenvectors of the Covariance matrix (principal components) are the directions of the axes where there is the most variance (most information).
Assuming n normalized data points xi, the first principal component (with most significant eigenvalue) is defined as: \[P_{1}= arg\max_{||p||=1} \sum_{i=1}^{n} \left ( p.x_{i} \right )^{2} \ \ \ (1)\]where '.' is the dot product.


Differential geometry

Extending principal components to differentiable manifolds requires basic knowledge of differential and Riemann geometry introduced in previous articles [ref  2, 3, 4].

Smooth manifold
A smooth (or differentiable) manifold is a topological manifold equipped with a globally defined differential structure. Locally, any topological manifold can be endowed with a differential structure by applying the homeomorphisms from its atlas and the standard differential structure of a vector space.

Tangent space
At every point P on a differentiable manifold, one can associate a tangent space, which is a real vector space that intuitively encompasses all the possible directions in which one can move tangentially through P. The elements within this tangent space at P are referred to as the tangent vectors, tgt_vector at P
This concept generalizes the idea of a vector originating from a specific point in Euclidean space. For a connected manifold, the dimension of the tangent space at any point is identical to the dimension of the manifold itself.

Geodesics
Geodesics are curves on a surface that only turn as necessary to remain on the surface, without deviating sideways. They generalize the concept of a "straight line" from a plane to a surface, representing the shortest path between two points on that surface.
Mathematically, a curve c(t) on a surface S is a geodesic if at each point c(t), the acceleration is zero or parallel to the normal vector:\[\frac{d^{2}}{dt^{2}}c(t) = 0 \ \ or \ \ \frac{d^{2}}{dt^{2}}c(t).\vec{n}=\frac{d^{2}}{dt^{2}}c(t)\]
Fig. 1 Illustration of a tangent vector and geodesic on a sphere

Logarithmic map
Given two points P and Q on a manifold, the vector on the tangent space v from P to Q is defined as: \[\left \| log_{P}(Q) \right \| = \left \| v \right \|\]

Tangent PCA

On a manifold, tangent spaces (or plane) are local euclidean space for which PCA can be computed. The purpose of Principal Geodesic Analysis is to project the principal components on the geodesic using the logarithmic map (inverse exponential map).
Given a mean m of n data points x[i] on a manifold with a set of geodesics at m, geod, the first principal component on geodesics is defined as:\[\begin{matrix} P_{1}=arg\max_{\left \| v \right \| = 1}\sum_{i=1}^{n}\left \langle v.log_{m}\pi _{geod}(x_{i}) \right \rangle^{2} \\ \pi_{geod}(x_{i}) = arg\max_{g \in geod} \left \| log_{m}(x_{i})-log_{m}(g) \right \|^2 \end{matrix}\]
Fig. 2 Illustration of a tangent vector and geodesic on a sphere

The mean point on the data point x[I] on the manifold can be defined as either a barycenter or a Frechet Mean [ref 6]. Our implementation in the next section relies on the Frechet mean.

  Implementation

For the sake of simplicity, we illustrate the concept of applying PCA on manifold geodesics using a simple manifold, Hypersphere we introduced in a previous article, Differentiable Manifolds for Geometric Learning: Hypersphere

    Setup

Let's encapsulate the evaluation of principal geodesics analysis in a class HyperspherePCA. We leverage the class HyperphereSpace [ref 7] its implementation of random generation of data points on the sphere.

ff from manifolds.hyperspherespace import HypersphereSpace
I   import numpy as np

  
    class HyperspherePCA(object):
  def __init__(self):
     self.hypersphere_space = HypersphereSpace(equip=True)

  def sample(self, num_samples: int) -> np.array:
     return self.hypersphere_space.sample(num_samples)


    Euclidean PCA

First let's implement the traditional PCA algorithm for 3 dimension (features) data set using scikit-learn library with two methods:
  • euclidean_pca_components to extract the 3 eigenvectors along the 3 eigenvalues
  • euclidean_pca_transform to project the evaluation data onto 3 directions defined by the eigenvectors.
    from sklearn.decomposition import PCA

   
    @staticmethod
    def euclidean_pca_components(data: np.array) -> (np.array, np.array):
   num_components = 3
   pca = PCA(num_components)
   pca.fit(data)
   return (pca.singular_values_, pca.components_)

    @staticmethod
    def euclidean_pca_transform(data: np.array) -> np.array:
   num_components = 3
   pca = PCA(num_components)
 
        return pca.fit_transform(data

We compute the principal components on 256 random 3D data points on the hypersphere.

nu num_samples = 256
pca_hypersphere = HyperspherePCA()

# Random data on Hypersphere
data = pca_hypersphere.sample(num_samples)
eigenvalues, components = HyperspherePCA.euclidean_pca_components(data)
transformed = pca_hypersphere.euclidean_pca_transform(data)        

print(f'\nPrincipal components:\n{components}\nEigen values: {eigenvalues')
print(f'\nTransformed data:\n{transformed}')

Output
Principal components:
[[ 0.63124842 -0.7752775  -0.02168467]
 [ 0.68247094  0.54196625  0.49041411]
 [ 0.36845467  0.32437229 -0.8712197 ]]
Eigen values: [9.84728751 9.0183337  8.65835924]

Important note: The 3 eigenvalues are similar because the input data is random.

    Tangent PCA on hypersphere 

The private method __tangent_pca computes principal components on the tangent plane using the logarithmic map. As detailed in the previous section, this implementation employs the Frechet mean as the base point on the manifold, which is the argument of the Geomstats method fit.

The method tangent_pca_components extracts the principal components computed using the logarithmic map. Finally the method tangent_pca_transform projects input data along the 3 principal components, similarly to its Euclidean counterpart, euclidean_pca_transform .

   from geomstats.learning.pca import TangentPCA
from geomstats.learning.frechet_mean import FrechetMean


def tangent_pca_components(self, data: np.array) -> np.array:
    tgt_pca = self.__tangent_pca(data)
    return tgt_pca.components_

def tangent_pca_transform(self, data: np.array) -> np.array:
    tgt_pca = self.__tangent_pca(data)
    tgt_pca.transform(data)

    
def __tangent_pca(self, data: np.array) -> TangentPCA:
     sphere = self.hypersphere_space.space
     tgt_pca = TangentPCA(sphere)
       
          # We use the Frechet mean as center of data points on the hypersphere
     mean = FrechetMean(sphere, method="default")
     mean.fit(data)
          # Frechet mean estimate
     estimate = mean.estimate_
        
          # Invoke Geomstats fitting method for PCA on tangent plane
          tgt_pca.fit(data, base_point=estimate)
     return tgt_pca

    We evaluate the principal geodesic components on hypersphere using the same 256 points randomly generated on the hypersphere and compared with the components on the Euclidean space.

nu  num_samples = 256
pca_hypersphere = HyperspherePCA()
        
# Random data point generated on Hypersphere
data = pca_hypersphere.sample(num_samples)
        
_, components = HyperspherePCA.euclidean_pca_components(data)
tangent_components = pca_hypersphere.tangent_pca_components(data)
print(f'\nEuclidean PCA components:\n{components}\nTangent Space PCA components:\n{tangent_components}')

Output:
Euclidean PCA components:
[[ 0.63124842 -0.7752775  -0.02168467]
 [ 0.68247094  0.54196625  0.49041411]
 [ 0.36845467  0.32437229 -0.8712197 ]]
Tangent Space PCA components:
[[ 0.8202982  -0.45814582  0.34236423]
 [ 0.45806882  0.16783651 -0.87292832]
 [-0.34246724 -0.87288792 -0.34753831]]

References



--------------------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Appendix

Here is the list of published articles related to geometric learning: