Showing posts with label clustering. Show all posts
Showing posts with label clustering. Show all posts

Wednesday, July 24, 2024

Uniform Manifold Approximation and Projection

Target audience: Expert
Estimated reading time: 8'
NewsletterGeometric Learning in Python                                                                       

Are you frustrated with the linear assumptions of Principal Component Analysis (PCA) or losing the global structure and relationships in your data with t-SNE? 
Leveraging Riemannian geometry, Uniform Manifold Approximation and Projection (UMAP) might be the solution you're looking for.


Table of Contents
       t-SNE
       UMAP
       Evaluation code
       Output
       Evaluation
       Output
       Tuning
Follow me on LinkedIn

What you will learn: How to leverage to Uniform Manifold Approximation and Projection method as a reliable non-linear dimensionality reduction technique that preserve the global distribution over low dimension space (Manifold).

Notes

  • Environments: Python  3.11,  SciKit-learn 1.5.1,  Matplotlib 3.9.1, umap-learn 0.5.6
  • Source code is available at  Github.com/patnicolas/Data_Exploration/umap
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Introduction

Dimension reduction algorithms can be classified into two categories:

  1. Models that maintain the global pairwise distance structure among all data samples, such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS).
  2. Models that preserve local distances, including t-distributed Stochastic Neighbor Embedding (t-SNE) and Laplacian Eigenmaps.

Uniform Manifold Approximation and Projection (UMAP) leverages research done for Laplacian eigenmaps to tackle the problem of uniform data distributions on manifolds by combining Riemannian geometry with advances in fuzzy simplicial sets [ref 1].

t-SNE is considered the leading method for dimension reduction in visualization. However, compared to t-SNE, UMAP maintains more of the global structure and offers faster run time performance.


t-SNE

t-SNE is a widely used technique for reducing the dimensionality of data and enhancing its visualization. It's often compared to PCA (Principal Component Analysis), another popular method. This article examines the performance of UMAP and t-SNE using two well-known datasets, Iris and MNIST.

t-SNE is an unsupervised, non-linear technique designed for exploring and visualizing high-dimensional data [ref 2]. It provides insights into how data is structured in higher dimensions by projecting it into two or three dimensions. This capability helps reveal underlying patterns and relationships, making it a valuable tool for understanding complex datasets.

UMAP

UMAP, developed by McInnes and colleagues, is a non-linear dimensionality reduction technique that offers several benefits over t-SNE, including faster computation and better preservation of the global structure of the data.

Benefits
  • Ease of Use: Designed to be easy to use, with a simple API for fitting and transforming data.
  • Scalability: Can handle large datasets efficiently.
  • Flexibility: Works with a variety of data types and supports both supervised and unsupervised learning.
  • Integration: Integrates with other Python scientific computing libraries in Python, such as Numpy, PyTorch and Scikit-learn.

The algorithm behind UMAP involves:

  1. Constructing a high-dimensional graph representation of the data.
  2. Capturing global relationships within the data.
  3. Encoding these relationships using a simplicial set.
  4. Applying stochastic gradient descent (SGD) to optimize the data's representation in a lower dimension, akin to the process used in autoencoders.

This method allows UMAP to maintain both local and global data structures, making it effective for various data analysis and visualization tasks.

For the mathematically inclined, the simplicial set captures the local and global relationships between two points xi and xj as: \[sim(x_{i}, x_{j})=e^{ - \frac{||x_{i}-x_{j}|| ^{2}}{\sigma_{i} \sigma_{j}}}\]with sigma parameters as scaler.
The cost function for all the points x with their representation yi is \[L=\sum_{i=1}^{n}\sum_{j=1}^{n}\left ( sim(x_{i}, x_{j}) - sim(y_{i}, y_{j}) \right )^2\]

Python module is installed through the usual pip utility.
    pip install umap-learn

The reader is invited to consult the documentation [ref 3].

Datasets

Let's consider the two datasets used for this evaluation:
  • MNIST: The MNIST database consists of 60,000 training examples and 10,000 test examples of handwritten digits. This dataset was originally created by Yann LeCun [ref 4].
  • IRIS: This dataset includes 150 instances, divided into three classes of 50 instances each, with each class representing a different species of iris plant [ref 5].

class DataSrc(Enum):
    MNIST = 'mnist'
    IRIS = 'iris'

First, we implement a data set loader for these two data sets using sklearn datasets module.

from sklearn.datasets import load_digits, load_iris


class DatasetLoader(object):
   def __init__(self, dataset_src: DataSrc) -> None:
       match dataset_src:
            case DataSrc.MNIST:
               digits = load_digits()
               self.data = digits.data
               self.color = digits.target.astype(int)
             
             case DataSrc.IRIS:
               images = load_iris()
               self.data = images.data
               self.names = images.target_names
               self.color = images.target.astype(int)
        self.dataset_src = dataset_src



t-SNE

Evaluation code

Let's wrap the evaluation of t-SNE algorithm for IRIS and MNIST data in a class TSneEval. The constructor initializes the Scikit-learn class TSNE with the appropriate number of components (dimension 2 or 3). The class TSneEval inherits for sklearn data loaders by subclassing DatasetLoader.

class TSneEval(DatasetLoader):
    def __init__(self, dataset_src: DataSrc, n_components: int) -> None:
        super(TSneEval, self).__init__(dataset_src)

        # Instantiate the Sklearn t-SNE model
        self.t_sne = TSNE(n_components=n_components)

The evaluation method, __call__, utilizes the Matplotlib library to visualize data clusters. It generates plots to display the clusters for each digit in the MNIST dataset and for each flower category in the Iris dataset.

def __call__(self, cmap: AnyStr) -> NoReturn:
     import matplotlib.pyplot as plt

     embedding = self.umap.fit_transform(self.data)
     x = embedding[:, 0]
     y = embedding[:, 1]
     n_ticks = 10

     plt.scatter(x=x, y=y, c=self.color, cmap=cmap, s=4.0)
     plt.colorbar(boundaries=np.arange(n_ticks+1) - 0.5).set_ticks(np.arange(n_ticks))
     plt.title(f'UMAP {self.dataset_src} {self.umap.n_neighbors} neighbors, min_dist: {self.umap.min_dist}')
     plt.show()


Output

The simple previous code snippet produces the t-SNE plots for MNIST and IRIS data sets.

MNIST dataset
First let's look at the 10 data clusters (1 per digit) in a two and 3 dimension plot.

Fig. 1  2-dimension t-SNE for MNIST data set


Fig. 2  3-dimension t-SNE for MNIST data set


IRIS dataset
The data related to each of the 3 types or Iris flowers are represented in 2 and 3 dimension plots

Fig. 3  2-dimension t-SNE for IRIS data set


Fig. 4  3-dimension t-SNE for IRIS data set


UMAP

Evaluation

We follow the same procedure for UMAP visualization as we do for t-SNE. The constructor of the UMAPEval wrapper class initializes the UMAP model with the specified number of neighbors (n_neighbors) and minimum distance (min_dist) parameters, which are used to represent the data.

import umap

class UMAPEval(DatasetLoader):
    def __init__(self, dataset_src: DataSrc, n_neighbors: int, min_dist: float) -> None:
       super(UMAPEval, self).__init__(dataset_src)

        # Instantiate the UMAP model
        self.umap = umap.UMAP(
               random_state=42, 
             n_neighbors=n_neighbors, 
             min_dist=min_dist)

Similar to the t-SNE evaluation, the __call__ method uses the Matplotlib library to visualize data clusters.

def __call__(self, cmap: AnyStr) -> NoReturn:
    embedding = self.umap.fit_transform(self.data)

    plt.scatter(
         x =embedding[:, 0], 
         y =embedding[:, 1], 
         c =self.color, 
         cmap =cmap, 
         s=4.0)
        .......  // Similar code as t-SNE
    plt.show()


Output


MNIST dataset

Fig. 5  UMAP plot for MNIST data  with 8 neighbors and min distance 0.3


Fig. 6 UMAP plot for MNIST data with 8 neighbors and min distance 0.6


Fig. 7 UMAP plot for MNIST data with 4 neighbors and min distance 0.8



IRIS dataset

Fig. 8  UMAP plot for IRIS data with 24 neighbors and min distance 0.005


Fig. 9 UMAP plot for IRIS data with 24 neighbors and min distance 0.001


Fig. 10 UMAP plot for IRIS data with 40 neighbors and min distance 0.001


Tuning

Number of neighbors
The parameter that defines the number of neighbors in UMAP, n_neighbors, determines the balance between global and local distances in the data visualization. A smaller number of neighbors means the local neighborhood is defined by fewer data points, which is ideal for detecting small clusters and capturing fine details. On the other hand, a larger number of neighbors helps in capturing broader, global patterns, but it may oversimplify the local relationships, potentially missing finer details in the data.

min_dist (compactness in low dimension)
This parameter plays a crucial role in determining the appearance of the low-dimensional representation, specifically affecting the clustering and spacing of data points.
A small min_dist value allows UMAP to pack points closer together in the low-dimensional space. It emphasizes the preservation of local structure and can make clusters more distinct.
A larger min_dist value enforces a minimum separation between points in the low-dimensional space. UMAP visualizations with a large min_dist  may show less clustering and more continuous distributions of data points.
The choice of min_dist thus influences the trade-off between preserving local versus global structures.


References



------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 
and Geometric Learning in Python Newsletter on LinkedIn.

Wednesday, October 10, 2018

K-Means Clustering in Java: Components

Target audience: Advanced
Estimated reading time: 5'

K-means clustering stands as a prominent unsupervised learning technique, aiming to classify unlabelled data into distinct categories. Its primary objective is to identify inherent groupings within the dataset. To achieve this, the algorithm operates in cycles, designating each data entry to a specific group based on its defining features.

This introductory piece in our series delves into the implementation of K-means' fundamental components.


Table of contents
Follow me on LinkedIn
 

Overview

Among the clustering methods have been developed over the years from Spectral clustering, Non-negative Matrix factorization, Canopy to Hierarchical and K-means clustering. The K-means algorithm is by far the easiest to implement. This simplicity comes with a high price in terms of scalability and even reliability. However, as an unsupervised learning technique, K-means is still valuable for reducing the number of model features or detecting anomalies.

The objective is to classify observations or data points by groups that share common attributes or features. The diagram below illustrates the clustering of observations (x,y) for a simple 2-feature model.



Each cluster has a mean or centroid, m = ( .. m..). First we need to define a distance between an observation  X = (...x ..) and c. The Manhattan and Euclidean distances are respectively defined as:  \[d_{M} = \sum_{i=0}^{n}\left | x_{i} - m_{i}\right |\,\,\,,\,\,d_{E}= \sum_{i=0}^{n} (x_{i} - m_{i})^{2}\] The loss function for N cluster Cj is defined by \[W(C)=\frac{1}{2}\sum_{k=0}^{N-1}\sum_{c_{i}=k}\sum_{C_{j}} d(x_{i},x_{j})\]  The goal is to find the centroid m, and clusters C, that minimize the loss function as: \[C^{*}\left (i \right ) = arg\min _{k\in [0,N-1]}d (x_{i}, m_{k})\]

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.


Distances and observations

First we need to define the distance between each observation and the centroid of a cluster. The class hierarchy related to the distance can be implemented as nested classes as there is no reason to "expose" to client code. The interface, Distance, defines the signature of the computation method. For sake of simplicity, the sample code implements only the Manhattan and Euclidean distances.  Exceptions, validation of method arguments, setter and getter methods are omitted for the sake of simplicity.

protected interface Distance {
    public double compute(double[] x, Centroid centroid);
}

    // Defintion of d(x,y) =|x-y|
protected class ManhattanDistance implements Distance {
   
   public double compute(double[] x, Centroid centroid) {
       double sum = 0.0, xx = 0.0;
       for( int k = 0; k< x.length; k++) {
           xx = x[k] - centroid.get(k);
           if( xx < 0.0) {
              xx = -xx;
           }
           sum += xx;
       }
       return sum;
    }
}

  // Definition d(x,y)= sqr(x-y) 
protected class EuclideanDistance implements Distance {
  
    public double compute(double[] x, Centroid centroid) {
        double sum = 0.0, xx = 0.0;
        for( int k = 0; k < x.length; k++) {
            xx = x[k] - centroid.get(k);
            sum += xx*xx;
        } 
        return Math.sqrt(sum);
    } 
}

Next, we define an observation (or data point) as a vector or array of floating point values, in our example.  An observation can support heterogeneous types (boolean, integer, float point,..) as long as they are normalized to [0,1]. In our example we simply normalized over the maximum values for all the observations.

public final class Observation {

    // use Euclidean distance that is shared between all the instances
   private static Distance metric = new EuclideanDistance();

   public static void setMetric(final Distance metric) {
      this.metric = metric;
   }
 
   private double[] _x  = null;
   private int  _index  = -1;

   public Observation(double[] x, int index) { 
       _x = x; 
       _index = index; 
   }
   
    // compute distance between each point and the centroid
   public double computeDistance(final Centroid centroid) {
       return metric.compute(_x, centroid);
   }

    // normalize the value of data points.
   public void normalize(double[] maxValues) {
      for( int k = 0; k < _x.length; k++) {
         _x[k] /= maxValues[k];
      }
   }
}


Clustering

Centroid for each cluster are computed iteratively to reduce the loss function.  The centroid values are computed using the mean of each feature across all the observations. The method Centroid.compute initialize the data points belonging to a cluster with the list of observations and compute the centroid values _x by normalizing with the number of points. 

protected class Centroid {
   private double[] _x = null;       
       
   protected Centroid() {}
   protected Centroid(double[] x) {
       Array.systemCopy(_x, x, 0, x.length, sizeOf(double));
   }

    // Compute the centoid values _x by normalizing with the number of points.
   protected void compute(final List<Observation> observations)  {
       double[] x = new double[_x.length];
       Arrays.fill(x, 0.0);
           
      for( Observation point : observations ) {
         for(int k =0; k < x.length; k++) {
            x[k] += point.get(k);
         }
      }
    
      int numPoints = observations.size();
      for(int k =0; k < x.length; k++) {
         _x[k] = x[k]/numPoints;
      }
   }
}

A cluster, KmeansCluster is defined by its label (_index in this example) a centroid, _centroid, the list of observations, _observations it contains and the current loss associated to the cluster (sum of the distance between all observations and the centroid).
The cluster behavior is defined by the following methods:
  • computeCentroid: compute the sum of the distance between all the point in this cluster and the centroid.
  • attach: Attach or add a new observation to this cluster
  • detach: Remove an existing observations from this cluster.

public final class KmeansCluster {
   private int       _index   = -1;
   private Centroid  _centroid  = null; 
   private double    _sumDistances  = 0.0;
   private List<observation> _observations = new ArrayList<Observation>()

   public void computeCentroid() {
      _centroid.compute( _observations );
      for( Observation point : _observations  ) {
          point.computeDistance(_centroid);
      }
      computeSumDistances();
   }

     // Attach a new observation to this cluster.
   public void attach(final Observation point) { 
      point.computeDistance(_centroid);
      _observations.add(point);
      computeSumDistances();
   }

   public void detach(final Observation point) {
      _observations.remove(point);
      computeSumDistances();
   }
           
   private void computeSumDistances() { 
      _sumDistances = 0.0;     
      for( Observation point : _observations) {
        _sumDistances += point.computeDistance(_centroid);
      }
   }
      //....
}

Finally, the clustering class implements the training and run-time classification. The train method iterates across all the clusters and for all the observations to reassign the observations to each cluster. The iterative computation ends when either the loss value converges or the maximum number of iterations is reached. 

If the algorithm use K clusters with M observations with N variables the execution time for creating the clusters is K*M*N. If the algorithm converges after T iterations then the overall execution is T*K*M*N. For instance, the K-means classification of 20K observations and data with 25 dimension, using 10 clusters, converging after 50 iterations requires  250,000,000 evaluations! The constructor create the clustering algorithm with a predefined number of cluster, K, and a set of observations.
The method getCentroids retrieves the current list of centroids (value of centroid vectors)

public final class KmeansClustering { 
   private KmeansCluster[] _clusters = null;
   private Observation[] _obsList = null; 
   private double _totalDistance  = 0.0;
   private Centroid[] _centroids = null;
   
   public KmeansClustering(int numClusters, final Observation[] obsList) {   
      _clusters = new KmeansCluster[numClusters];
      for (int i = 0; i < numClusters; i++) {
         _clusters[i] = new KmeansCluster(i);
      }
      _obsList = obsList;
   }

 
   public final List<double[]> getCentroids() {
       List<double[]> centroidDataList = null;

       if(_clusters != null &&; _clusters.length < 0) {
           centroidDataList = new LinkedList<double[]>();
           for( KmeansCluster cluster : _clusters) {
               centroidDataList.add(cluster.getCentroid().getX());
           }
       }
       return centroidDataList;
   }
}

The next article, K-means clustering in Java: Classification  describes the implementation of the training and classification tasks.

Thank you for reading this article. For more information ...

References

  • The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
  • Machine Learning: A Probabilisitc Perspective 11.4.2.5 K-means algorithm - K. Murphy - MIT Press 2012
  • Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006 
  • github.com/patnicolas


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3