K Means Cluster Specification (partition)
Main Arguments:
num_clusters = 3
Computational engine: stats
There are currently two engines: stats::kmeans (default) and ClusterR::KMeans_rcpp.
It is also possible to change the algorithmic details of the implementation, by changing the engine and/or using the corresponding arguments from the engine functions:
Note that the stats::kmeans and the ClusterR::KMeans_rcpp implementations have very different default settings for the algorithmic details, so it is recommended to be deliberate and explicit in choosing these options, check here.
1.4 Fit the k-means model
Once specified, a model may be “fit” to a dataset by providing a formula and data frame in the same manner as a tidymodels model fit. Note that unlike in supervised modeling, the formula should not include a response variable.
Length Class Mode
spec 4 k_means list
fit 9 kmeans list
elapsed 1 -none- list
preproc 4 -none- list
tidyclust also provides a function, extract_fit_summary(), to produce a list of model summary information in a format that is consistent across all cluster model specifications and engines
Note that this function renames clusters in accordance with the standard tidyclust naming convention and ordering: clusters are named “Cluster_1”, “Cluster_2”, etc. and are numbered by the order they appear in the rows of the training dataset.
1.6 Centroids
A secondary output of interest is often the characterization of the clusters; i.e., what data feature trends does each cluster seem to represent? Most commonly, clusters are characterized by their mean values in the predictor space, a.k.a. the centroids.
Based on the above output, we might say that Cluster_1 is penguins with smaller bill lengths, Cluster_2 has smaller bill depths, and Cluster_3 is penguins with large bills in both dimensions.
1.7 Sum of squared error
One simple metric is the within cluster sum-of-squared error (WSS), which measures the sum of all distances from observations to their cluster center. This is sometimes scaled with the total sum-of-squared error (TSS), the distance from all observations to the global centroid; in particular, the ratio WSS/TSS is often computed. In principle, small values of WSS or of the WSS/TSS ratio suggest that the observations within clusters are closer (more similar) to each other than they are to the other clusters.
The WSS and TSS come “for free” with the model fit summary, or they can be accessed directly from the model fit:
kmeans_summary$sse_within_total_total
[1] 944.4986 754.7437 617.9859
kmeans_summary$sse_total
[1] 11494.04
kmeans_fit %>%sse_within_total()
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sse_within_total standard 2317.
kmeans_fit %>%sse_total()
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sse_total standard 11494.
kmeans_fit %>%sse_ratio()
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sse_ratio standard 0.202
We can also see the within sum-of-squares by cluster, rather than totalled, with sse_within():
Another common measure of cluster structure is called the silhouette. The silhouette of a single observation is proportional to the average distance from that observation to within-cluster observations minus the average distance to outside-cluster observations; normalized by the greater of these two average. In principle, a large silhouette (close to 1) suggests that an observation is more similar to those within its cluster than those outside its cluster.
We can average all silhouettes to get a metric for the full clustering fit. Because the computation of the silhouette depends on the original observation values, a dataset must also be supplied to the function
kmeans_fit %>%silhouette_avg(penguins)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 silhouette_avg standard 0.488
1.9 Changing distance measures
These metrics all depend on measuring the distance between points and/or centroids. By default, ordinary Euclidean distance is used. However, it is possible to select a different distance function.
For sum of squares metrics, the distance function supplied must take two arguments (i.e., the observation locations and the centroid locations). For the sihouette metric, the distance function must find pairwise distances from a single matrix (i.e., all pairwise distances between observations).
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 silhouette_avg standard 0.494
1.10 Workflows
The workflow structure of tidymodels is also usable with tidyclust objects. In the following example, we try two recipes for clustering penguins by bill dimensions. In the second recipe, we log-scale both predictors before clustering.
If you have not yet read the k_means vignette, we recommend reading that first; functions that are used in this vignette are explained in more detail there.
2.1 A brief introduction to hierarchical clustering
Hierarchical Clustering, sometimes called Agglomerative Clustering, is a method of unsupervised learning that produces a dendrogram, which can be used to partition observations into clusters.
The hierarchical clustering process begins with each observation in it’s own cluster; i.e., n clusters for n observations.
The closest two observations are then joined together into a single cluster.
This process continues, with the closest two clusters being joined (or “aggolermated”) at each step.
The result of the process is a dendrogram, which shows the joining of clusters in tree form:
Warning in dist(fake_dat): NAs introduced by coercion
2.1.1 Clusters from dendrogram
To produce a partition-style cluster assignment from the dendrogram, one must “cut” the tree at a chosen height:
The observations that remain joined in the dendrogram below the cut height are considered to be in a cluster together:
# A tibble: 5 × 2
observation cluster_assignment
<chr> <int>
1 a 1
2 b 1
3 c 2
4 d 2
5 e 2
2.1.2 Methods of aggolmeration
At every step of the agglomeration, we measure distances between current clusters. With each cluster containing (possibly) multiple points, what does it mean to measure distance?
There are four common approaches to cluster-cluster distancing, aka “linkage”:
single linkage: The distance between two clusters is the distance between the two closest observations.
average linkage: The distance between two clusters is the average of all distances between observations in one cluster and observations in the other.
complete linkage: The distance between two clusters is the distance between the two furthest observations.
centroid method: The distance between two clusters is the distance between their centroids (geometric mean or median).
Ward’s method: The distance between two clusters is proportional to the increase in error sum of squares (ESS) that would result from joining them. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.
It is also worth mentioning the McQuitty method, which retains information about previously joined clusters to measure future linkage distance. This method is currently supported for model fitting, but not for prediction, in tidyclust.
2.2hier_clust specification in {tidyclust}
To specify a hierarchical clustering model in tidyclust, simply choose a value of num_clusters and (optionally) a linkage method:
Length Class Mode
spec 4 hier_clust list
fit 7 hclust list
elapsed 1 -none- list
preproc 4 -none- list
To produce a dendrogram plot, access the engine fit: (Although as we see below, dendrograms are often not very informative for moderate to large size datasets.)
hc_fit$fit %>%plot()
We can also extract the standard tidyclust summary list:
Note that, although the hierarchical clustering algorithm is not focused on cluster centroids in the same way \(k\)-means is, we are still able to compute the geometric mean over the predictors for each cluster:
To predict the cluster assignment for a new observation, we find the closest cluster. How we measure “closeness” is dependent on the specified type of linkage in the model:
single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.
complete linkage: The new observation is assigned to the cluster with the smallest maximum distances between training observations and the new observation.
average linkage: The new observation is assigned to the cluster with the smallest average distances between training observations and the new observation.
centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.
Ward’s method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.
It’s important to note that there is no guarantee that predict() on the training data will produce the same results as extract_cluster_assignments(). The process by which clusters are created during the aggolmerations results in a particular partition; but if a training observation is treated as new data, it is predicted in the same manner as truly new information.
Suppose we have produced cluster assignments from two models: a hierarchical clustering model with three clusters (as above) and a \(k\)-means clustering model with five clusters (below). How can we combine these assignments?
However, they are not fully unrelated assignments. For example, all of KM_2 in the \(k\)-means assignment fell inside HC_1 for the hierarchical assignments.
Our goal is to relabel the five \(k\)-means clusters to match the three cluster names in the hierarchical output. This can be accomplished with reconcile_clusterings_mapping().
This function expects two vectors of cluster labels as input. The first is the label that will be matched, and the second is the label that will be recoded to the first.
If we are not trying to simply match names across two same-size clusterings, the option one_to_one must be set to FALSE.
In this example, we can see that KM_1, KM_2, KM_5 have been matched to HC_1; and KM_3 and KM_4 have been matched to HC_2. Notice that no clusters from the KM set were matched to HC_3; evidently, this is a small cluster that did not manifest clearly in the \(k\)-means clustering.