Wednesday, July 24, 2024

Uniform Manifold Approximation and Projection

Target audience: Expert
Estimated reading time: 8'
NewsletterGeometric Learning in Python                                                                       

Are you frustrated with the linear assumptions of Principal Component Analysis (PCA) or losing the global structure and relationships in your data with t-SNE? 
Leveraging Riemannian geometry, Uniform Manifold Approximation and Projection (UMAP) might be the solution you're looking for.


Table of Contents
       t-SNE
       UMAP
       Evaluation code
       Output
       Evaluation
       Output
       Tuning
Follow me on LinkedIn

What you will learn: How to leverage to Uniform Manifold Approximation and Projection method as a reliable non-linear dimensionality reduction technique that preserve the global distribution over low dimension space (Manifold).

Notes

  • Environments: Python  3.11,  SciKit-learn 1.5.1,  Matplotlib 3.9.1, umap-learn 0.5.6
  • Source code is available at  Github.com/patnicolas/Data_Exploration/umap
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

Dimension reduction algorithms can be classified into two categories:

  1. Models that maintain the global pairwise distance structure among all data samples, such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS).
  2. Models that preserve local distances, including t-distributed Stochastic Neighbor Embedding (t-SNE) and Laplacian Eigenmaps.

Uniform Manifold Approximation and Projection (UMAP) leverages research done for Laplacian eigenmaps to tackle the problem of uniform data distributions on manifolds by combining Riemannian geometry with advances in fuzzy simplicial sets [ref 1].

t-SNE is considered the leading method for dimension reduction in visualization. However, compared to t-SNE, UMAP maintains more of the global structure and offers faster run time performance.


t-SNE

t-SNE is a widely used technique for reducing the dimensionality of data and enhancing its visualization. It's often compared to PCA (Principal Component Analysis), another popular method. This article examines the performance of UMAP and t-SNE using two well-known datasets, Iris and MNIST.

t-SNE is an unsupervised, non-linear technique designed for exploring and visualizing high-dimensional data [ref 2]. It provides insights into how data is structured in higher dimensions by projecting it into two or three dimensions. This capability helps reveal underlying patterns and relationships, making it a valuable tool for understanding complex datasets.

UMAP

UMAP, developed by McInnes and colleagues, is a non-linear dimensionality reduction technique that offers several benefits over t-SNE, including faster computation and better preservation of the global structure of the data.

Benefits
  • Ease of Use: Designed to be easy to use, with a simple API for fitting and transforming data.
  • Scalability: Can handle large datasets efficiently.
  • Flexibility: Works with a variety of data types and supports both supervised and unsupervised learning.
  • Integration: Integrates with other Python scientific computing libraries in Python, such as Numpy, PyTorch and Scikit-learn.

The algorithm behind UMAP involves:

  1. Constructing a high-dimensional graph representation of the data.
  2. Capturing global relationships within the data.
  3. Encoding these relationships using a simplicial set.
  4. Applying stochastic gradient descent (SGD) to optimize the data's representation in a lower dimension, akin to the process used in autoencoders.

This method allows UMAP to maintain both local and global data structures, making it effective for various data analysis and visualization tasks.

For the mathematically inclined, the simplicial set captures the local and global relationships between two points xi and xj as: \[sim(x_{i}, x_{j})=e^{ - \frac{||x_{i}-x_{j}|| ^{2}}{\sigma_{i} \sigma_{j}}}\]with sigma parameters as scaler.
The cost function for all the points x with their representation yi is \[L=\sum_{i=1}^{n}\sum_{j=1}^{n}\left ( sim(x_{i}, x_{j}) - sim(y_{i}, y_{j}) \right )^2\]

Python module is installed through the usual pip utility.
    pip install umap-learn

The reader is invited to consult the documentation [ref 3].

Datasets

Let's consider the two datasets used for this evaluation:
  • MNIST: The MNIST database consists of 60,000 training examples and 10,000 test examples of handwritten digits. This dataset was originally created by Yann LeCun [ref 4].
  • IRIS: This dataset includes 150 instances, divided into three classes of 50 instances each, with each class representing a different species of iris plant [ref 5].

class DataSrc(Enum):
    MNIST = 'mnist'
    IRIS = 'iris'

First, we implement a data set loader for these two data sets using sklearn datasets module.

from sklearn.datasets import load_digits, load_iris


class DatasetLoader(object):
   def __init__(self, dataset_src: DataSrc) -> None:
       match dataset_src:
            case DataSrc.MNIST:
               digits = load_digits()
               self.data = digits.data
               self.color = digits.target.astype(int)
             
             case DataSrc.IRIS:
               images = load_iris()
               self.data = images.data
               self.names = images.target_names
               self.color = images.target.astype(int)
        self.dataset_src = dataset_src



t-SNE

Evaluation code

Let's wrap the evaluation of t-SNE algorithm for IRIS and MNIST data in a class TSneEval. The constructor initializes the Scikit-learn class TSNE with the appropriate number of components (dimension 2 or 3). The class TSneEval inherits for sklearn data loaders by subclassing DatasetLoader.

class TSneEval(DatasetLoader):
    def __init__(self, dataset_src: DataSrc, n_components: int) -> None:
        super(TSneEval, self).__init__(dataset_src)

        # Instantiate the Sklearn t-SNE model
        self.t_sne = TSNE(n_components=n_components)

The evaluation method, __call__, utilizes the Matplotlib library to visualize data clusters. It generates plots to display the clusters for each digit in the MNIST dataset and for each flower category in the Iris dataset.

def __call__(self, cmap: AnyStr) -> NoReturn:
     import matplotlib.pyplot as plt

     embedding = self.umap.fit_transform(self.data)
     x = embedding[:, 0]
     y = embedding[:, 1]
     n_ticks = 10

     plt.scatter(x=x, y=y, c=self.color, cmap=cmap, s=4.0)
     plt.colorbar(boundaries=np.arange(n_ticks+1) - 0.5).set_ticks(np.arange(n_ticks))
     plt.title(f'UMAP {self.dataset_src} {self.umap.n_neighbors} neighbors, min_dist: {self.umap.min_dist}')
     plt.show()


Output

The simple previous code snippet produces the t-SNE plots for MNIST and IRIS data sets.

MNIST dataset
First let's look at the 10 data clusters (1 per digit) in a two and 3 dimension plot.

Fig. 1  2-dimension t-SNE for MNIST data set


Fig. 2  3-dimension t-SNE for MNIST data set


IRIS dataset
The data related to each of the 3 types or Iris flowers are represented in 2 and 3 dimension plots

Fig. 3  2-dimension t-SNE for IRIS data set


Fig. 4  3-dimension t-SNE for IRIS data set


UMAP

Evaluation

We follow the same procedure for UMAP visualization as we do for t-SNE. The constructor of the UMAPEval wrapper class initializes the UMAP model with the specified number of neighbors (n_neighbors) and minimum distance (min_dist) parameters, which are used to represent the data.

import umap

class UMAPEval(DatasetLoader):
    def __init__(self, dataset_src: DataSrc, n_neighbors: int, min_dist: float) -> None:
       super(UMAPEval, self).__init__(dataset_src)

        # Instantiate the UMAP model
        self.umap = umap.UMAP(
               random_state=42, 
             n_neighbors=n_neighbors, 
             min_dist=min_dist)

Similar to the t-SNE evaluation, the __call__ method uses the Matplotlib library to visualize data clusters.

def __call__(self, cmap: AnyStr) -> NoReturn:
    embedding = self.umap.fit_transform(self.data)

    plt.scatter(
         x =embedding[:, 0], 
         y =embedding[:, 1], 
         c =self.color, 
         cmap =cmap, 
         s=4.0)
        .......  // Similar code as t-SNE
    plt.show()


Output


MNIST dataset

Fig. 5  UMAP plot for MNIST data  with 8 neighbors and min distance 0.3


Fig. 6 UMAP plot for MNIST data with 8 neighbors and min distance 0.6


Fig. 7 UMAP plot for MNIST data with 4 neighbors and min distance 0.8



IRIS dataset

Fig. 8  UMAP plot for IRIS data with 24 neighbors and min distance 0.005


Fig. 9 UMAP plot for IRIS data with 24 neighbors and min distance 0.001


Fig. 10 UMAP plot for IRIS data with 40 neighbors and min distance 0.001


Tuning

Number of neighbors
The parameter that defines the number of neighbors in UMAP, n_neighbors, determines the balance between global and local distances in the data visualization. A smaller number of neighbors means the local neighborhood is defined by fewer data points, which is ideal for detecting small clusters and capturing fine details. On the other hand, a larger number of neighbors helps in capturing broader, global patterns, but it may oversimplify the local relationships, potentially missing finer details in the data.

min_dist (compactness in low dimension)
This parameter plays a crucial role in determining the appearance of the low-dimensional representation, specifically affecting the clustering and spacing of data points.
A small min_dist value allows UMAP to pack points closer together in the low-dimensional space. It emphasizes the preservation of local structure and can make clusters more distinct.
A larger min_dist value enforces a minimum separation between points in the low-dimensional space. UMAP visualizations with a large min_dist  may show less clustering and more continuous distributions of data points.
The choice of min_dist thus influences the trade-off between preserving local versus global structures.


References



------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Tuesday, July 16, 2024

Riemannian Metric for SPD Manifolds

Target audience: Intermediate
Estimated reading time: 7'

Choosing between Riemannian and Euclidean metrics for classifying signal or dense data can be challenging. This article offers engineers an easy method to select the most suitable metric using the pyRiemann library.


Table of contents
      Setup
      Implementation
      Scoring

What you will learn: How to determine the suitable metric for a given dataset to optimize the decision boundary and accuracy for any classifier.

Follow me on LinkedIn

Notes

  • Environments: Python  3.11, pyRiemann 0.6, mne 1.7.1, SciKit-learn 1.5.1,  Matplotlib 3.9.1
  • This article assumes that the reader is somewhat familiar with Riemannian geometry [ref 1, 2]. 
  • Source code is available at  Github.com/patnicolas/Data_Exploration/classifiers
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

Geometric learning tackles the challenges posed by scarce data, high-dimensional spaces, and the requirement for independent representations in the creation of advanced machine learning models. The fundamental aim of studying Riemannian geometry is to comprehend and scrutinize the characteristics of curved spaces, which are not sufficiently explained by Euclidean geometry alone. 

This article is the 11th installment in our ongoing series on geometric learning. Most articles in this series use the Geomstats differential geometry library to implement some of the most common machine learning algorithms on the hypersphere.
In this post, we use
pyRiemann [ref 3], a Python library dedicated to signal processing and time series analysis on Riemannian manifolds.

The two most common Riemann metric for SPD matrices are:
  • Affine-Invariant metric
  • Log-Euclidean metric
These two metrics have been described and evaluated in a previous post: [ref 4]

Symmetric positive definite (SPD) manifold

SPD matrices have been introduced in a previous article in Logistic Regression on Riemann Manifolds

A square matrix A is symmetric if it is identical to its transpose, meaning that if aaij are the entries of A, then aaij=ajiaij. This implies that A can be fully described by its upper triangular elements.
aij
A square matrix A is positive definite if, for every non-zero vector b, the product bTAb >= 0 [ref 5].

If a matrix A is both symmetric and positive definite, it is referred to as a symmetric positive definite (SPD) matrix. This type of matrix is extremely useful and appears in various real-world applications. A prominent example in statistics is the covariance matrix, where each entry represents the covariance between two variables (with diagonal entries indicating the variances of individual variables). Covariance matrices are always positive semi-definite (meaning bTAb0), and they are positive definite if the covariance matrix has full rank, which occurs when each row is linearly independent from the others.
The collection of all SPD matrices of size n×n forms a manifold.

pyRiemann library

pyRiemann is a Python machine learning package built on the scikit-learn API. It offers a high-level interface for classifying real or complex-valued multivariate data using the Riemannian geometry of symmetric positive definite (SPD) and Hermitian positive definite (HPD) matrices.

Its primary aim is to conduct multivariate data analysis on time series within these Riemannian manifolds. Our use case consists of comparing various known machine learning algorithms in Euclidean and Riemann space (SPD matrices). The data sets consists of signals such as Electroencephalograms (EEG) or Magnetic Resonance Images (MRI) used in brain-computer interface (BCI) and required mne library to be loaded.

Setup

The installation and configuration of pyRiemann is straight forward [ref 6].

To Install pyRiemann module: pip install pyriemann
To install mne: pip install mne
Source from Github: git clone https://github.com/pyRiemann/pyRiemann.git


Datasets

We are comparing two widely used machine learning algorithms, Support Vector Machine (SVM) and k-Nearest Neighbors (k-NN), for Symmetric Positive Definite matrices in Euclidean space (using Scikit-learn) and on a Riemannian manifold. The comparison will be conducted using two datasets:
  • Set 1: Data from Electroencephalograms (Brain-Computer Interface) [ref 7]
  • Set 2: Synthetic Gaussian distributed data

Let's encapsulate the data generation process within a class named SPDMatricesDataset. We consider a sample of 48 SPD matrices, with target values defined as binary {0, 1}. The create method generates the two datasets described in the previous section.

Note: Some methods and variables that are not essential for understanding the algorithms have been omitted.

class SPDMatricesDataset(object):

    def __init__(self) -> None:
        n_spd_matrices = 48
self.target = np.concatenate([ np.zeros(n_spd_matrices), np.ones(n_spd_matrices) ])


    # Generation of data sets used in comparing SVM and kNN over
    #  Euclidean space and Riemannian manifold

    def create(self) -> List[np.array]:
        evals_lows = 11
     class_sep_ratio = 1.0
spd_matrices = self.__make_spd_matrices(evals_lows) return [ (spd_matrices, self.target), # Set 1 self.__make_gaussian_blobs(class_sep_ratio), # Set 2 ]

The two data sets are visualized with scatter plots using matplotlib module.
Fig. 1 Visualization of SPD matrices data set - Set 1

Fig. 2 Visualization of gaussian data set - Set 2


Finally, the method train_test_data_split extracts the training and test data from the features and target data, and encapsulate them into the SPDTrainingData  data class (see Appendix).

@staticmethod
def train_test_data_split(features: np.array, target: np.array) -> SPDTrainingData:
    from sklearn.model_selection import train_test_split

    train_X, test_X, train_y, test_y = train_test_split(
        features,
        target,
        test_size=0.3,
        random_state=42
    )

    return SPDTrainingData(train_X, test_X, train_y, test_y)


Evaluation

The evaluation of the two metrics (Euclidean and Riemannian) for any given classifier involves two steps:
  • Calculate the classifier's score for both metrics.
  • Define the decision boundary for each of the two datasets.

Implementation

We encapsulate the evaluation of these metrics within a class named SPDMatricesClassifierTraining and scoring the classifier on the two datasets utilize the pyRiemann API [ref 8], which conveniently follows the method signatures of Scikit-learn's equivalent functions.

class SPDMatricesClassifier(object):
    def __init__(self,
                 classifier,
                 spd_metric: SPDMetric,
                 spd_training_data: SPDTrainingData) -> None:
        self.classifier = classifier                                # Target classifier (SVM,...)
        self.spd_metric = spd_metric                         # Metric (Euclidean, ...
        self.spd_training_data = spd_training_data   # Our training data

    # Train then score the given classifier using pyRiemann API
  
    def score(self) -> float:
         # 1. Select metric
        self.classifier.set_params(**{'metric': str(self.spd_metric.value)})

        # 2. Train model
        self.classifier.fit(self.spd_training_data.train_X, self.spd_training_data.train_y)

        # 3. Score model on the test data
return self.classifier.score( self.spd_training_data.test_X, self.spd_training_data.test_y)


    @staticmethod
    @partial(np.vectorize, excluded=['clf'])
    def get_probability(cov_x: np.array, cov_y: np.array, cov_z: np.array, clf):
        cov = np.array(
            [[cov_x, cov_y, 0.0, 0.0], 
             [cov_y, cov_z, 0.0, 0.0], 
             [0.0, 0.0, 0.0, 0.0], 
             [0.0, 0.0, 0.0, 0.0]]
        )
        u = cov[np.newaxis, ...]

        return clf.predict_proba(u)[0, 1]

Scoring

The score is computed on the test data and target from a sample of 48 SPD matrices

K-Nearest Neighbors (k=4)
Euclidean
             Set 1: 0.827
             Set 2: 0.655
Riemann - 
             Set 1: 0.896
             Set 2: 0.689
          

Support Vector Machine
Euclidean
             Set 1: 0.965
             Set 2: 0.697
Riemann
             Set 1: 1.000
             Set 2: 0.586

As expected, both k-Nearest Neighbors and Support Vector Machines achieve higher scores on the Riemannian manifold for SPD matrices.


Classifier decision boundary 

The objective is to evaluate the decision boundary between target values {0, 1} for classifying the two datasets using a support vector machine with both Riemannian and Euclidean metrics. The input parameters consist of 48 SPD matrices, a range [11, 15] for displaying decision boundary with a class separation of 1.0


Dataset 1 (SPD matrices)
Fig. 3 Decision Boundary Support Vector Machine - Euclidean space


Fig. 4 Decision Boundary Support Vector Machine - Riemann Manifold


Dataset 2 (Synthetic Gaussian)
Fig. 5 Decision Boundary Support Vector Machine - Euclidean space

Fig. 6 Decision Boundary Support Vector Machine - Riemann Manifold



References



------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Appendix

Training data

The SPDTrainingData is a convenient data class to wrap the training and testing data.

@dataclass
class SPDTrainingData:
    train_X: np.array
    test_X: np.array
    train_y: np.array
    test_y: np.array


Plotting

This method of class SPDMatricesClassifier trains, scores a selected classifier on data set defined in spd_training_data. It  creates scatter plots for train and test data and finally visualizes the various decision boundaries.

 def plot(self, title: Optional[AnyStr] = None) -> NoReturn:
        
        # Step 1: Train and compute the score for the classifier
        score: float = self.score()
        
        # Step 2: Create scatter plots
        ax = SPDMatricesDataset.create_scatter_plots(
            self.spd_training_data,
            fig=plt.figure(figsize=(16, 8))
        )

        # Step 3: Create contour for decision boundaries
        self.__create_contour(ax, self.spd_training_data.get_spd_dataset_limits())

        font_dict =  {
            'family': 'serif',
            'color':  'darkred',
            'weight': 'bold',
            'size': 20,
        }
        plt.title(title, fontdict=font_dict)

The implementation of methods SPDMatricesDataset.create_scatter_plots and SPDMatricesClassifier.__create_contour isavailable on GitHub [ref 9].
\