Hierarchical Clustering From Scratch

Introduction

In this article I will walk you through the implementation of the hierarchical clustering method. The code can be found HERE. Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. It can be performed using two main approaches: bottom-up (agglomerative) and top-down (divisive). Here’s a comparison and explanation of both methods.

Bottom-Up (agglomerative) Hierarchical Clustering

Clustering steps:

Initialization: Start with each data point as its own cluster.
Iteration:
- Compute the distances between all pairs of clusters.
- Find the pair of clusters that are closest together.
- Merge the closest pair into a single cluster.
- Update the distance matrix to reflect the merger.
Termination: Repeat the iteration process until all points are merged into a single cluster or until the desired number of clusters is achieved.

Top-Down (divisive) Hierarchical Clustering

Clustering steps:

Initialization: Start with all data points in a single cluster.
Iteration:
- Divide the cluster into two sub-clusters using a clustering algorithm (e.g., k-means).
- Repeat the division process for each resulting cluster.
Termination: Continue the iteration process until each point is its own cluster or until the desired number of clusters is achieved.

When to use each approach

Bottom-Up (Agglomerative): Suitable for small to medium-sized datasets, especially when computational resources are limited. It is easier to implement and can handle various types of distance metrics.
Top-Down (Divisive): May be preferred for larger datasets or when a global perspective of the data structure is needed. It can be more accurate in some cases but requires more sophisticated methods and computational power.

Dendrogram

Both methods represents clusters with a dendrogram which is a tree-like diagram that illustrates the arrangement of the clusters produced by hierarchical clustering. It shows the hierarchical relationship between the clusters, from individual data points at the leaves (bottom) to a single cluster at the root (top). The component of a dendrogram are:

Leaves (Terminal Nodes): Represent individual data points or objects. Each leaf is usually labeled with the corresponding data point.
Branches (Edges): Connect the leaves and internal nodes. The length of the branches often reflects the distance or dissimilarity between clusters.
Internal Nodes: Represent clusters formed by merging or splitting. Each internal node signifies a cluster that contains all the data points below it.
Height: Indicates the distance or dissimilarity at which clusters are merged or split. The vertical position of a node indicates the distance at which clusters are joined.

Coding bottom-up (agglomerative) hierarchical clustering from scratch

In this blog post we will code the bottom-up (agglomerative) hierarchical clustering using Python. The process is straight forward: we recursively add datapoints together based on a selected distance metric and linkage criterium until all data points are clustered in one single cluster.

In order to cluster together a distance metric and linkage criterium need be chosen. A distance metric is a measure of the similarity or dissimilarity between data points. It quantifies how close or far apart two data points (or clusters of data points) are in the feature space.

The most commonly distance metrics are:

Euclidean distance
Manhattan distance
Cosine distance. The cosine distance between two vectors is given by the following formula:

where ‖*‖2 is the 2-norm of its argument *, and v ⋅ u is the dot product of v and u.

Once the distance metric has been chosen, next step is choosing the linkage criterion (aka linkage method). Linkage refers to the method used to determine the distance between clusters. Linkage criteria define how the distance between two clusters is calculated from the distances between the points within those clusters. Different linkage methods can produce different cluster formations. The most common types of linkage criteria are:

Single Linkage (also known as minimum Linkage and nearest neighbor method):
- The distance between two clusters is defined as the minimum distance between any single point in the first cluster and any single point in the second cluster.
- It tends to produce elongated and "chained" clusters and works well for datasets where the clusters are well-separated. Because it is sensitive to outliers, it performs poorly with overlapping or close clusters
Complete Linkage (also known as maximum Linkage and furthest neighbor method):
- The distance between two clusters is defined as the maximum distance between any single point in the first cluster and any single point in the second cluster.
- Tends to produce more compact and spherical clusters.
- It is less sensitive to noise and outliers compared to single linkage, but it can struggle when clusters have elongate and non-convex shape and with clusters with unequal shapes (example one cluster is small and the other one is big)
Average Linkage (also known as unweighted pair group method with arithmetic mean - UPGMA):
- The distance between two clusters is defined as the average distance between all pairs of points, where each pair consists of one point from each cluster.
- It produces clusters that are a compromise between single and complete linkage.
Centroid Linkage:
- The distance between two clusters is defined as the distance between their centroids. (mean points)
Ward's Linkage (aka Ward's minimum variance method):
- It is both a distance metric and a linkage method.
- The distance between two clusters is defined as the increase in the total within-cluster variance after merging the two clusters. Within-cluster variance (aka within-cluster sum of squares - WCSS) is a measure used to quantify the variability of data points within individual clusters. It provides an indication of how tightly the data points within a cluster are grouped together. Ward's linkage seeks to cluster datapoints that minimizes the total within-cluster variance. The formula to calculate within-cluster variance is:

where ||xi - μk|| is the Euclidean distance between the data point xi and the centroid μk.

It tends to produce compact and spherical clusters.

Implementation and results

In this implementation of hierarchical clustering, data points and clusters are represented by two different classes, named BaseCluster and MixCluster, respectively. The use of two classes makes the implementation easier and more intuitive (in my opinion), but it is not strictly necessary.

The BaseCluster class is designed to encapsulate the attributes and behavior of a single data point that will be part of a hierarchical cluster:

The MixCluster class is designed to encapsulate the attributes and behavior of a cluster formed by merging two existing clusters. These existing clusters can be any combination of BaseCluster and MixCluster. Note that after merging the attribute plotline_yaxis_end_line (where the line along the y axis ends in the dendrogram) for each cluster is set to the distance between the two clusters.

The function calculate_distance calculates the distance between two clusters using various distance metrics and linkage methods.

The function generate_distance_matrix generates a distance matrix for a list of clusters. It iterates over each pair of clusters (r, c) in the distances matrix using nested loops. The diagonal elements ( r equals c) are set to a very large number (1e10). This indicates that the distance between a cluster and itself is effectively infinite, preventing self-merging in clustering algorithms.

Clustering is achieved with the function clustering. It performs hierarchical clustering on a dataset and returns a single MixCluster that represents the hierarchical clustering of all data points. It continuously performs the following steps until only one cluster remains: 1- generate distance matrix, 2- find closest clusters, 3- get indices of closest clusters, 4- merge clusters, 5- update cluster list, 6- check for completion (is only one cluster left?).

Two helper functions are need to plot the dendrogram. The first one, complete_xcoordinates assigns x-coordinates to clusters for plotting a dendrogram. This is achieved by recursively determining the x-coordinates for each cluster based on their hierarchical relationships and the order of data points.

The second helper function is plot_dendrogram and it visualizes the hierarchical structure of clusters using a dendrogram and also plots the original data points for reference.

Now we have coded everything we need to cluster some simulated data. We generated 15 data points in a 2-dimensional space and clustered them. The resulting dendrogram, obtained using Euclidean distance and single linkage, is shown below:

Dendrogram obtained with Euclidean distance and single linkage

Next we can cluster the same data using the Ward method:

Even though the values of the distances are significantly different (ward methods generates much larger distances), the clustering is very similar but not identical, which is expected when using different clustering metrics and linkage methods.

Selecting the optimal number of clusters using the dendrogram

Once the dendrogram is created, the next step is to decide how many clusters the data contains. The most common method is as follows:

Identify the longest vertical lines in the dendrogram that are not interrupted by horizontal lines. These represent large distances at which clusters are joined, indicating the most significant merges.
Draw a horizontal line (cut-off line) in the middle of the longest vertical line identified in the previous step. The number of clusters is determined by the number of vertical lines the horizontal cut-off line intersects. Each intersection represents a separate cluster.

Using the dendrogram obtained with Euclidean distance and single linkage, we would have 3 clusters:

Clusters identification using the dendrogram

Another method to select the optimal number of clusters is the elbow method, typically used with K-means clustering. For each value of k (number of clusters), the sum of squared distances (or the total within-cluster sum of squares, WCSS) is calculated. The optimal number of clusters is indicated by the "elbow" point in the WCSS and k plot (k on the x-axis and WCSS on the y-axis); the elbow point is where the rate of decrease in WCSS slows down significantly.

Conclusion

This blog post contains a comprehensive implementation of all steps involved in hierarchical clustering and dendrogram visualization. Thank you for your attention, subscribe and follow me on LinkedIn for more engaging content!