Showing posts with label scikit-learn. Show all posts

Sunday, December 15, 2024

Approximating PCA on Manifolds

Target audience: Advanced

Estimated reading time: 5'

Posts history

Newsletter: Geometric Learning in Python

Have you ever wondered how to perform Principal Component Analysis on manifolds? An approximate solution relies on the fact that tangent spaces are locally Euclidean.

Introduction

Principal components analysis (PCA)

PCA for manifolds

Implementation

Components

Computation

Evaluation

Dimension reduction alternatives

References

What you will learn: How to approximate PCA on a manifold using Eigen decomposition (orthonormal basis) on the tangent space.

Notes:

Environments: Python 3.12.5, GeomStats 2.8.0, Matplotlib 3.9, Numpy 2.2.0, Scikit-learn 1.6.0
Source code is available on GitHub [ref 1]
To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statement.

Introduction

This post is a follow-up to the Fréchet Centroid on Manifolds article [ref 2]

Principal Components Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique in machine learning that simplifies large datasets into smaller sets while preserving key patterns and trends.

The PCA process can be summarized in five steps [ref 3]:

Standardize the Data: Normalize the range of the original continuous variables.
Calculate the Covariance Matrix: Identify correlations between variables.
Determine Eigenvectors and Eigenvalues: Compute these from the covariance matrix to identify the principal components.
Form the Feature Vector: Select the principal components to retain.
Reproject the Data: Transform the data along the axes of the principal components.

Fig. 1 Visualization of principal components in 3D Euclidean space

Note: These five steps are implemented (and abstracted) in the Scikit-learn Python library within the decomposition.PCA class.

PCA has some serious limitations:

Assumption of linear relationship: PCA only captures linear correlations between features.
Sensitivity to scaling: Features with larger ranges dominate the principal components.
Sensitivity to noise: Noise and outliers can distort the features set
Sensitivity to missing data: It requires some level of amputation
Assumption of Gaussian distribution
Unsupervised learning: It ignores class labels making unsuitable to classification
Assumption of feature importance: All features have the same importance or weights.

PCA for manifolds

The representation of points on the manifold as tangent vectors at a reference point to fit any machine learning algorithm. The principal components analysis is no exception.

The tangent space is a linear approximation of the manifold at a specific point (usually the Fréchet centroid of the data on the manifold [ref 2] ).

By projecting data onto the tangent space, PCA can be applied as if the data were in Euclidean space.

Fig. 2 Visualization of principal components in locally Euclidean tangent plane

PCA on the tangent space provides a way to represent manifold-valued data in a lower-dimensional, Euclidean space for tasks such as visualization, clustering, or regression. It enables the application of statistical analysis tools and tests requiring a Euclidean space.

This approach has numerous benefits

It allows handling of non-Euclidean data
It linearizes the manifold locally, reducing the computation complexity entailed by a manifold
The covariance is intrinsic to the manifold.

Implementation

Components

The goal is to develop a representation and programming framework capable of handling both linear PCA (Euclidean) and locally linear PCA (Manifold).

Let's begin by defining a class, PrincipalComponents, to represent components for both Euclidean spaces and manifolds.
The class attributes include:

base_point: A point on the manifold corresponding to the tangent space, locally Euclidean, where the principal components are calculated.
Components: A NumPy array representing the principal components.

@dataclass
class PrincipalComponents:
    base_point: np.array
    components: np.array

Next, we define a class, ManifoldPCA, which supports Principal Component Analysis (PCA) for manifolds embedded in a 3D ambient Euclidean space.

The class includes two attributes:

space: Represents the space type, which can either inherit from `Manifold` (with a locally Euclidean tangent space at a base point) or be `Euclidean` (globally Euclidean, where the base point can be any point in 3D space).
tgt_pca: Specifies the PCA computation class, which can be either Geomstats tangentPCA or Scikit-learn's PCA class.

The three main methods are:

estimate: Computes the principal components for a set of points on the manifold.
project: Projects points on the manifold along the principal components (eigenvector transformation).
geodesics: Visualizes the curves on the hypersphere associated with the principal components.

class ManifoldPCA(object):
    def __init__(self, space: Manifold) -> None:
        from geomstats.learning.pca import TangentPCA

        from geomstats.geometry.euclidean import Euclidean
        from sklearn.decomposition import PCA


        self.tgt_pca = TangentPCA(space, n_components=2) \

                               if not is instance(space, Euclidean) \

                              else PCA(n_components=3)
        self.space = space

    def estimate(self,

                         manifold_pts: np.array,

                         base_point: Optional[np.array] = None) -> PrincipalComponents:

    def project(self, manifold_pts: np.array) -> np.array:

    def geodesics(self,

                            principal_components: PrincipalComponents) -> List[np.array]:

Computation

The estimate method delegates the computation of principal components to either the tangent space (__tangent_pca) or the 3D Euclidean space (__euclidean_pca), depending on the context. It uses the Geomstats TangentPCA.fit [ref 4] method for manifolds or the Scikit-learn PCA.fit [ref 5] method for Euclidean spaces.

In this article we use the Hypersphere [ref 6] as our test manifold.

For PCA in the tangent space, the Fréchet centroid is used as the base point if none is provided. In the 3D Euclidean space, the arithmetic mean serves as the base point.

As expected, PCA in the Euclidean space produces three-dimensional principal components, whereas PCA in the locally Euclidean tangent space results in two-dimensional principal components.

def estimate(self,
                     manifold_pts: np.array,
                     base_point: Optional[np.array] = None) -> PrincipalComponents:
    from geomstats.geometry.euclidean import Euclidean

    return self .__tangent_pca(manifold_pts, base_point) \

              if not isinstance(self.space, Euclidean)  \  \
              else self.__euclidean_pca(manifold_pts, base_point)

"""  Compute the principal components on the tangent of the manifold """

def __tangent_pca(self,
                          manifold_pts: np.array,
                          base_point: Optional[np.array] = None) -> PrincipalComponents:
        from geomstats.learning.frechet_mean import FrechetMean

        # If no base points have been defined ... then use the centroid
        if base_point is None:
            frechet_mean = FrechetMean(self.space)
            frechet_mean.fit(X=manifold_pts, y=None)
            base_point = frechet_mean.estimate_


        # Compute the 2-dimension principal components in intrinsic coordinates
        self.pca.fit(manifold_pts, base_point=base_point)
        return PrincipalComponents(base_point, self.pca.components_)

"""  Compute the principal components on the Euclidean space """


def __euclidean_pca(self,
                            manifold_pts: np.array,
                            base_point: Optional[np.array]=None) -> PrincipalComponents:

        self.pca.fit(manifold_pts)
        mean = np.mean(manifold_pts, axis=0) if base_point is None else base_point

        return PrincipalComponents(mean, self.pca.components_)

The principal components are directions and constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. The method project redefine the values of the data points in relation of the new orthonormal basis.

def project(self, manifold_pts: np.array) -> np.array:
    return self.tgt_pca.transform(manifold_pts)

The geodesics corresponding to each orthonormal basis are generated for visualization purposes.

def geodesics(self, principal_components PrincipalComponents) -> List[np.array]:
     import geomstats.backend as gs

     # For now, we generate geodesics only for spheres
     if not isinstance(self.space, Hypersphere):
          raise GeometricException(

            f'Cannot extract geodesics from this type of manifold {self.space}'

          )

     # Extract the geodesic for each of the principal components
     base_point = principal_components.base_point
     geodesics = [self.space.metric.geodesic(initial_point=base_point,

                                                                      initial_tangent_vec=component)
                          for component in principal_components.omponents]
        
     # Generate a trace of 200 data points for visualization purpose
     trace = gs.linspace(-1.0, 1.0, 200)
     geodesic_traces = [geodesic(trace) for geodesic in geodesics]


     return geodesic_traces

Evaluation

The goal of this evaluation is to compute and visualize the principal components for a set of data points on a hypersphere [ref ], both:

On the tangent plane at the Fréchet mean using Geomstats.
In the surrounding 3D Euclidean space using Scikit-learn.

The steps are explained in the comments within the following code snippet.

# 1. Generate random points on the hypersphere
sphere = Hypersphere(dim=2)
X = sphere.random_uniform(n_samples=8)

# 2. Perform principal components analysis on tangent plane of the hypersphere
manifold_pca = ManifoldPCA(sphere)
principal_components = manifold_pca.estimate(X)
print(f'\nPrincipal components for Hypersphere:\n'

        f'{principal_components.components}')

# 3. Project the feature set on the 2D tangent plane
projected_points = manifold_pca.project(X)
print(f'\nProjected points on tangent space:\n{projected_points}')

# 4. Create and visualize the geodesics associated to the principal components
geodesic_components = manifold_pca.geodesics(principal_components)

# 5. Visualization 
hypersphere_plot = HyperspherePlot(X, principal_components.base_point)
hypersphere_plot.show(geodesic_components)

# 6. Apply linear PCA to the data points on hypershoer
euclidean_pca = ManifoldPCA(Euclidean(3))
principal_components = euclidean_pca.estimate(X)
print(f'\nPrincipal components in Euclidean space')
print(principal_components.components)


# 7. Project the features set on the Euclidean space
projected_points = euclidean_pca.project(X)
print(f'\nProjected points in Euclidean space:\n{projected_points}')

PCA for hypersphere data on tangent space

Output:

Principal components for Hypersphere:

[[ 0.37805 -0.21186 -0.90121]
[ 0.74583 0.64640 0.16091]]

Projected points on tangent space:

[[ 0.28557 -0.58902]

[-0.98468 0.58031]

[-0.69088 0.13017]

[-1.16979 -0.79723]

[-0.49557 -0.04852]

[-0.08720 0.82096]

[ 1.64977 0.56592]

[ 1.49278 -0.66260]]

The following 3D plot shows the 8 random points on the hypersphere, along with the two geodesics corresponding to the first and second principal components on the tangent plane, as well as the Fréchet centroid.

Fig. 3 Visualization of random points on a 3D sphere, along with

geodesics PCA and Fréchet mean

The manifold points are projected on the tangent plane then transformed along the two orthonormal basis.

Fig. 4 Visualization of hypersphere data point projected along the 2 orthonormal bases

on the tangent plane

PCA for hypersphere data on Euclidean space

Principal components in Euclidean space:

[[-0.07085 0.00221 0.99742 ]

[ 0.79399 0.60542 0.05505]

[-0.60377 0.79589 -0.04465]]

Projected points in Euclidean space:

[[-0.25303 -0.50401 -0.49961]

[ 0.68800 0.36335 0.29347]

[ 0.64130 0.04860 -0.10587]

[ 0.51994 -0.65668 0.48229]

[ 0.51764 -0.09482 -0.28160]

[ 0.15862 0.71983 -0.18580]

[-1.12479 0.42143 0.22098]

[-1.14769 -0.29770 0.07614 ]]

This final 3D plot illustrates the original 8 data points on the hypersphere, transformed along the three orthonormal bases in Euclidean space.

Fig. 5 Visualization of hypersphere data point projected along the 3 orthonormal bases

in the Euclidean space

Dimension reduction alternatives

There are several alternative algorithms dedicated to non-linear dimension reduction among them:

Kernel Principal Component Analysis (Kernel PCA) extends Principal Component Analysis (PCA) to handle data that is not linearly separable. By applying a kernel function to map the data into a higher-dimensional space, Kernel PCA reveals complex underlying structures in the data [ref 7].

t-SNE is an unsupervised, non-linear technique designed for exploring and visualizing high-dimensional data. It provides insights into how data is structured in higher dimensions by projecting it into two or three dimensions [ref 8].

Uniform Manifold Approximation and Projection (UMAP): leverages research done for Laplacian eigenmaps to tackle the problem of uniform data distributions on manifolds by combining Riemannian geometry with advances in fuzzy simplicial sets [ref 9].

Locally Linear Embedding (LLE) aims to create a lower-dimensional representation of data while preserving the distances within local neighborhoods. It can be viewed as performing a series of local Principal Component Analyses, which are then combined globally to determine the optimal non-linear embedding [ref 10].

Principal Geodesic Analysis

Target audience: Advanced

Estimated reading time: 7'

Posts history
Newsletter: Geometric Learning in Python

Principal Component Analysis (PCA) is essential for dimensionality and noise reduction, feature extraction, and anomaly detection. However, its effectiveness is limited by the strong assumption of linearity.

Principal Geodesic Analysis (PGA) addresses this limitation by extending PCA to handle non-linear data that lies on a lower-dimensional manifold.

Introduction

Principal components

Principal Component Analysis

Differential geometry

Tangent PCA

Implementation

Setup

Euclidean PCA

Tangent PCA on hypersphere

References

Appendix

What you will learn: How to implement principal geodesic analysis as extension of principal components analysis on a manifolds tangent space.

The complete article, featuring design principles, detailed implementation, in-depth analysis, and exercises, is available in Hands-on Principal Geodesic Analysis Hands-on Principal Geodesic Analysis

Notes:

Environments: Python 3.10.10, Geomstats 2.7.0, Scikit-learn 1.4.2
This article assumes that the reader is somewhat familiar with differential and tensor calculus [ref 1]. Please refer to our previous articles related to geometric learning [ref 2, 3, 4].
Source code is available at Github.com/patnicolas/Data_Exploration/manifolds
To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

This article is the ninth installments of our ongoing series focused on geometric learning. As with previous articles, we utilize the Geomstats Python library [ref. 5] to implement concepts associated with geometric learning.

Note: Summaries of my earlier articles on this topic can be found in the Appendix

As a reminder, the primary goal of learning Riemannian geometry is to understand and analyze the properties of curved spaces that cannot be described adequately using Euclidean geometry alone.

This article revisits the widely used unsupervised learning technique, Principal Component Analysis (PCA), and its counterpart in non-Euclidean space, Principal Geodesic Analysis (PGA).

The content of this article is as follows:

Brief recap of PCA
Overview of key components of differential geometry
Introduction to PCA on tangent space using the logarithmic map
Implementation in Python using the Geomstats library

Principal components

Principal component analysis

Principal Component Analysis (PCA) is a technique for reducing the dimensionality of a large dataset, simplifying it into a smaller set of components while preserving important patterns and trends. The goal is to reduce the number of variables of a data set, while preserving as much information as possible.

For a n-dimensional data, PCA tries to put maximum possible information in the first component c0, then maximum remaining information in the second c1 and so on.

Finally, we select the p << n top first components such as:\[\left \{ c_{i} :\ \ \sum_{i=1}^{p} c_{i} < T \ and \ \ c_{1} \geqslant c_{2} .... \geqslant c_{p} \right \}\]

The principal components are actually the eigenvectors of the Covariance matrix.

The first thing to understand about eigenvectors and eigenvalues is that they always appear in pairs, with each eigenvector corresponding to an eigenvalue. Additionally, the number of these pairs matches the number of dimensions in the data.

For instance, in a 3-dimension space, the eigenvalues are extracted from the following 3x3 symmetric Covariance matrix:\[\begin{bmatrix} cov(x, x)) & cov(x,y) & cov(x, z)\\ cov(x, y) & cov(y,y) & cov(y,z) \\ cov(x, z) & cov(y, z) & cov(z, z) \end{bmatrix}\] as cov(a, b) = cov(b, a).

The eigenvectors of the Covariance matrix (principal components) are the directions of the axes where there is the most variance (most information).

Assuming n normalized data points xi, the first principal component (with most significant eigenvalue) is defined as: \[P_{1}= arg\max_{||p||=1} \sum_{i=1}^{n} \left ( p.x_{i} \right )^{2} \ \ \ (1)\]where '.' is the dot product.

Differential geometry

Extending principal components to differentiable manifolds requires basic knowledge of differential and Riemann geometry introduced in previous articles [ref 2, 3, 4].

Smooth manifold

A smooth (or differentiable) manifold is a topological manifold equipped with a globally defined differential structure. Locally, any topological manifold can be endowed with a differential structure by applying the homeomorphisms from its atlas and the standard differential structure of a vector space.

Tangent space

At every point P on a differentiable manifold, one can associate a tangent space, which is a real vector space that intuitively encompasses all the possible directions in which one can move tangentially through P. The elements within this tangent space at P are referred to as the tangent vectors, tgt_vector at P.

This concept generalizes the idea of a vector originating from a specific point in Euclidean space. For a connected manifold, the dimension of the tangent space at any point is identical to the dimension of the manifold itself.

Geodesics

Geodesics are curves on a surface that only turn as necessary to remain on the surface, without deviating sideways. They generalize the concept of a "straight line" from a plane to a surface, representing the shortest path between two points on that surface.

Mathematically, a curve c(t) on a surface S is a geodesic if at each point c(t), the acceleration is zero or parallel to the normal vector:\[\frac{d^{2}}{dt^{2}}c(t) = 0 \ \ or \ \ \frac{d^{2}}{dt^{2}}c(t).\vec{n}=\frac{d^{2}}{dt^{2}}c(t)\]

Fig. 1 Illustration of a tangent vector and geodesic on a sphere

Logarithmic map

Given two points P and Q on a manifold, the vector on the tangent space v from P to Q is defined as: \[\left \| log_{P}(Q) \right \| = \left \| v \right \|\]

Tangent PCA

On a manifold, tangent spaces (or plane) are local euclidean space for which PCA can be computed. The purpose of Principal Geodesic Analysis is to project the principal components on the geodesic using the logarithmic map (inverse exponential map).

Given a mean m of n data points x[i] on a manifold with a set of geodesics at m, geod, the first principal component on geodesics is defined as:\[\begin{matrix} P_{1}=arg\max_{\left \| v \right \| = 1}\sum_{i=1}^{n}\left \langle v.log_{m}\pi _{geod}(x_{i}) \right \rangle^{2} \\ \pi_{geod}(x_{i}) = arg\max_{g \in geod} \left \| log_{m}(x_{i})-log_{m}(g) \right \|^2 \end{matrix}\]

Fig. 2 Illustration of a tangent vector and geodesic on a sphere

The mean point on the data point x[I] on the manifold can be defined as either a barycenter or a Frechet Mean [ref 6]. Our implementation in the next section relies on the Frechet mean.

Implementation

For the sake of simplicity, we illustrate the concept of applying PCA on manifold geodesics using a simple manifold, Hypersphere we introduced in a previous article, Differentiable Manifolds for Geometric Learning: Hypersphere

Setup

Let's encapsulate the evaluation of principal geodesics analysis in a class HyperspherePCA. We leverage the class HyperphereSpace [ref 7] its implementation of random generation of data points on the sphere.

ff                                 # --- Code Snippet 1 -----

    from manifolds.hyperspherespace import HypersphereSpace

I   import numpy as np

    class HyperspherePCA(object):
  def __init__(self):
     self.hypersphere_space = HypersphereSpace(equip=True)

  def sample(self, num_samples: int) -> np.array:
     return self.hypersphere_space.sample(num_samples)

Euclidean PCA

First let's implement the traditional PCA algorithm for 3 dimension (features) data set using scikit-learn library with two methods:

euclidean_pca_components to extract the 3 eigenvectors along the 3 eigenvalues
euclidean_pca_transform to project the evaluation data onto 3 directions defined by the eigenvectors.

                                                  # --- Code Snippet 2 -----

     from sklearn.decomposition import PCA

    @staticmethod

    def euclidean_pca_components(data: np.array) -> (np.array, np.array):
   num_components = 3
   pca = PCA(num_components)
   pca.fit(data)
   return (pca.singular_values_, pca.components_)

    @staticmethod

    def euclidean_pca_transform(data: np.array) -> np.array:
   num_components = 3
   pca = PCA(num_components)

        return pca.fit_transform(data)

We compute the principal components on 256 random 3D data points on the hypersphere.

nu                                                  # --- Code Snippet 3 -----

     num_samples = 256
pca_hypersphere = HyperspherePCA()

# Random data on Hypersphere
data = pca_hypersphere.sample(num_samples)
eigenvalues, components = HyperspherePCA.

                                                     euclidean_pca_components(data)
transformed = pca_hypersphere.euclidean_pca_transform(data)        

print(f'\nPrincipal components:\n{components}')
print(f'Eigen values:{eigenvalues}\nTransformed data:\n{transformed}')

Output

Principal components:

[[ 0.63124842 -0.7752775 -0.02168467]

[ 0.68247094 0.54196625 0.49041411]

[ 0.36845467 0.32437229 -0.8712197 ]]

Eigen values: [9.84728751 9.0183337 8.65835924]

Important note: The 3 eigenvalues are similar because the input data is random.

Tangent PCA on hypersphere

The private method __tangent_pca computes principal components on the tangent plane using the logarithmic map. As detailed in the previous section, this implementation employs the Frechet mean as the base point on the manifold, which is the argument of the Geomstats method fit.

The method tangent_pca_components extracts the principal components computed using the logarithmic map. Finally the method tangent_pca_transform projects input data along the 3 principal components, similarly to its Euclidean counterpart, euclidean_pca_transform .

                                                    # --- Code Snippet 4 -----

      from geomstats.learning.pca import TangentPCA
from geomstats.learning.frechet_mean import FrechetMean


def tangent_pca_components(self, data: np.array) -> np.array:
    tgt_pca = self.__tangent_pca(data)
    return tgt_pca.components_

def tangent_pca_transform(self, data: np.array) -> np.array:
    tgt_pca = self.__tangent_pca(data)
    tgt_pca.transform(data)

    
def __tangent_pca(self, data: np.array) -> TangentPCA:
     sphere = self.hypersphere_space.space
     tgt_pca = TangentPCA(sphere)

          # We use the Frechet mean as center of data points

          # on the hypersphere
     mean = FrechetMean(sphere, method="default")
     mean.fit(data)

          # Frechet mean estimate
     estimate = mean.estimate_

          # Invoke Geomstats fitting method for PCA on tangent plane

          tgt_pca.fit(data, base_point=estimate)
     return tgt_pca

We evaluate the principal geodesic components on hypersphere using the same 256 points randomly generated on the hypersphere and compared with the components on the Euclidean space.

nu                                                      # --- Code Snippet 5 -----

     num_samples = 256

     pca_hypersphere = HyperspherePCA()
        
# Random data point generated on Hypersphere
data = pca_hypersphere.sample(num_samples)
        
_, components = HyperspherePCA.

                                              euclidean_pca_components(data)
tangent_components = pca_hypersphere.

                                              tangent_pca_components(data)
print(f'\nEuclidean PCA components:\n{components}'

     print(f'Tangent Space PCA components:\n{tangent_components}')

Output:

Euclidean PCA components:

[[ 0.63124842 -0.7752775 -0.02168467]

[ 0.68247094 0.54196625 0.49041411]

[ 0.36845467 0.32437229 -0.8712197 ]]

Tangent Space PCA components:

[[ 0.8202982 -0.45814582 0.34236423]

[ 0.45806882 0.16783651 -0.87292832]

[-0.34246724 -0.87288792 -0.34753831]]

Sunday, December 15, 2024

Approximating PCA on Manifolds

Introduction

Principal Components Analysis (PCA)

PCA for manifolds

Implementation

Components

Computation

Evaluation

Dimension reduction alternatives

Wednesday, June 12, 2024

Principal Geodesic Analysis

Introduction

Principal components

Principal component analysis

Differential geometry

Tangent PCA

Implementation

Setup

Euclidean PCA

Tangent PCA on hypersphere

Contact Form

Equation Editor

Popular Posts