Clustering Analysis Workflow

Biostat 212A

Author

Dr. Jin Zhou @ UCLA

Published

March 17, 2026

Display system information for reproducibility.

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.50        jsonlite_2.0.0    xfun_0.56        
[13] digest_0.6.37     rlang_1.1.6       evaluate_1.0.5   
library(workflows)
library(parsnip)
library(tidyclust)
library(tidyverse)
library(tidymodels)
library(RcppHungarian)
set.seed(838383)

1 K-means clustering

1.1 Acknoledgement

This document is adapted from the tidyclust k_means package vignette.

1.2 Load and clean a dataset: penguins

data("penguins", package = "modeldata")

penguins <- penguins %>%
  select(bill_length_mm, bill_depth_mm) %>%
  drop_na()


# shuffle rows
penguins <- penguins %>%
  sample_n(nrow(penguins))

1.3 k-means specification in {tidyclust}

kmeans_spec <- k_means(num_clusters = 3)

kmeans_spec
K Means Cluster Specification (partition)

Main Arguments:
  num_clusters = 3

Computational engine: stats 

There are currently two engines: stats::kmeans (default) and ClusterR::KMeans_rcpp.

It is also possible to change the algorithmic details of the implementation, by changing the engine and/or using the corresponding arguments from the engine functions:

kmeans_spec_lloyd <- k_means(num_clusters = 3) %>%
  parsnip::set_engine("stats", algorithm = "Lloyd")

kmeans_spec_cr <- k_means(num_clusters = 3) %>%
  parsnip::set_engine("ClusterR", initializer = "random")

Note that the stats::kmeans and the ClusterR::KMeans_rcpp implementations have very different default settings for the algorithmic details, so it is recommended to be deliberate and explicit in choosing these options, check here.

1.4 Fit the k-means model

Once specified, a model may be “fit” to a dataset by providing a formula and data frame in the same manner as a tidymodels model fit. Note that unlike in supervised modeling, the formula should not include a response variable.

kmeans_fit <- kmeans_spec %>%
  fit(~ bill_length_mm + bill_depth_mm,
    data = penguins
  )

kmeans_fit %>%
  summary()
        Length Class   Mode
spec    4      k_means list
fit     9      kmeans  list
elapsed 1      -none-  list
preproc 4      -none-  list

tidyclust also provides a function, extract_fit_summary(), to produce a list of model summary information in a format that is consistent across all cluster model specifications and engines

kmeans_summary <- kmeans_fit %>%
  extract_fit_summary()

kmeans_summary %>% str()
List of 7
 $ cluster_names         : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
 $ centroids             : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ bill_length_mm: num [1:3] 38.4 45.5 50.9
  ..$ bill_depth_mm : num [1:3] 18.3 15.6 17.3
 $ n_members             : int [1:3] 141 116 85
 $ sse_within_total_total: num [1:3] 944 755 618
 $ sse_total             : num 11494
 $ orig_labels           : int [1:342] 1 1 1 2 3 3 2 2 1 3 ...
 $ cluster_assignments   : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 2 3 3 2 2 1 3 ...

1.5 Cluster assignments and centers

kmeans_fit %>%
  extract_cluster_assignment()
# A tibble: 342 × 1
   .cluster 
   <fct>    
 1 Cluster_1
 2 Cluster_1
 3 Cluster_1
 4 Cluster_2
 5 Cluster_3
 6 Cluster_3
 7 Cluster_2
 8 Cluster_2
 9 Cluster_1
10 Cluster_3
# ℹ 332 more rows

Note that this function renames clusters in accordance with the standard tidyclust naming convention and ordering: clusters are named “Cluster_1”, “Cluster_2”, etc. and are numbered by the order they appear in the rows of the training dataset.

1.6 Centroids

A secondary output of interest is often the characterization of the clusters; i.e., what data feature trends does each cluster seem to represent? Most commonly, clusters are characterized by their mean values in the predictor space, a.k.a. the centroids.

These can be accessed from the full summary:

kmeans_summary$centroids
# A tibble: 3 × 2
  bill_length_mm bill_depth_mm
           <dbl>         <dbl>
1           38.4          18.3
2           45.5          15.6
3           50.9          17.3

They can also be accessed directly from the fitted model with:

kmeans_fit %>%
  extract_centroids()
# A tibble: 3 × 3
  .cluster  bill_length_mm bill_depth_mm
  <fct>              <dbl>         <dbl>
1 Cluster_1           38.4          18.3
2 Cluster_2           45.5          15.6
3 Cluster_3           50.9          17.3

Based on the above output, we might say that Cluster_1 is penguins with smaller bill lengths, Cluster_2 has smaller bill depths, and Cluster_3 is penguins with large bills in both dimensions.

1.7 Sum of squared error

One simple metric is the within cluster sum-of-squared error (WSS), which measures the sum of all distances from observations to their cluster center. This is sometimes scaled with the total sum-of-squared error (TSS), the distance from all observations to the global centroid; in particular, the ratio WSS/TSS is often computed. In principle, small values of WSS or of the WSS/TSS ratio suggest that the observations within clusters are closer (more similar) to each other than they are to the other clusters.

The WSS and TSS come “for free” with the model fit summary, or they can be accessed directly from the model fit:

kmeans_summary$sse_within_total_total
[1] 944.4986 754.7437 617.9859
kmeans_summary$sse_total
[1] 11494.04
kmeans_fit %>% sse_within_total()
# A tibble: 1 × 3
  .metric          .estimator .estimate
  <chr>            <chr>          <dbl>
1 sse_within_total standard       2317.
kmeans_fit %>% sse_total()
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 sse_total standard      11494.
kmeans_fit %>% sse_ratio()
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 sse_ratio standard       0.202

We can also see the within sum-of-squares by cluster, rather than totalled, with sse_within():

kmeans_fit %>%
  sse_within()
# A tibble: 3 × 3
  .cluster    wss n_members
  <fct>     <dbl>     <int>
1 Cluster_1  944.       141
2 Cluster_2  755.       116
3 Cluster_3  618.        85

1.8 Silhouette

Another common measure of cluster structure is called the silhouette. The silhouette of a single observation is proportional to the average distance from that observation to within-cluster observations minus the average distance to outside-cluster observations; normalized by the greater of these two average. In principle, a large silhouette (close to 1) suggests that an observation is more similar to those within its cluster than those outside its cluster.

We can average all silhouettes to get a metric for the full clustering fit. Because the computation of the silhouette depends on the original observation values, a dataset must also be supplied to the function

kmeans_fit %>%
  silhouette_avg(penguins)
# A tibble: 1 × 3
  .metric        .estimator .estimate
  <chr>          <chr>          <dbl>
1 silhouette_avg standard       0.488

1.9 Changing distance measures

These metrics all depend on measuring the distance between points and/or centroids. By default, ordinary Euclidean distance is used. However, it is possible to select a different distance function.

For sum of squares metrics, the distance function supplied must take two arguments (i.e., the observation locations and the centroid locations). For the sihouette metric, the distance function must find pairwise distances from a single matrix (i.e., all pairwise distances between observations).

my_dist_1 <- function(x) {
  Rfast::Dist(x, method = "manhattan")
}

my_dist_2 <- function(x, y) {
  Rfast::dista(x, y, method = "manhattan")
}

kmeans_fit %>% sse_ratio(dist_fun = my_dist_2)
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 sse_ratio standard       0.202
kmeans_fit %>% silhouette_avg(penguins, dist_fun = my_dist_1)
# A tibble: 1 × 3
  .metric        .estimator .estimate
  <chr>          <chr>          <dbl>
1 silhouette_avg standard       0.494

1.10 Workflows

The workflow structure of tidymodels is also usable with tidyclust objects. In the following example, we try two recipes for clustering penguins by bill dimensions. In the second recipe, we log-scale both predictors before clustering.

penguins_recipe_1 <- recipe(~ bill_length_mm + bill_depth_mm,
  data = penguins
)

penguins_recipe_2 <- recipe(~ bill_length_mm + bill_depth_mm,
  data = penguins
) %>%
  step_log(all_numeric_predictors())

wflow_1 <- workflow() %>%
  add_model(kmeans_spec) %>%
  add_recipe(penguins_recipe_1)

wflow_2 <- workflow() %>%
  add_model(kmeans_spec) %>%
  add_recipe(penguins_recipe_2)

wflow_1 %>%
  fit(penguins) %>%
  extract_centroids()
# A tibble: 3 × 3
  .cluster  bill_length_mm bill_depth_mm
  <fct>              <dbl>         <dbl>
1 Cluster_1           38.4          18.3
2 Cluster_2           45.5          15.6
3 Cluster_3           50.9          17.3
wflow_2 %>%
  fit(penguins) %>%
  extract_centroids()
# A tibble: 3 × 3
  .cluster  bill_length_mm bill_depth_mm
  <fct>              <dbl>         <dbl>
1 Cluster_1           3.65          2.90
2 Cluster_2           3.90          2.92
3 Cluster_3           3.85          2.70

2 Hierarchical clustering

Load and clean a dataset:

data("penguins", package = "modeldata")

penguins <- penguins %>%
  select(bill_length_mm, bill_depth_mm) %>%
  drop_na()

# shuffle rows
penguins <- penguins %>%
  sample_n(nrow(penguins))

If you have not yet read the k_means vignette, we recommend reading that first; functions that are used in this vignette are explained in more detail there.

2.1 A brief introduction to hierarchical clustering

Hierarchical Clustering, sometimes called Agglomerative Clustering, is a method of unsupervised learning that produces a dendrogram, which can be used to partition observations into clusters.

The hierarchical clustering process begins with each observation in it’s own cluster; i.e., n clusters for n observations.

scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e.

The closest two observations are then joined together into a single cluster.

scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles.

This process continues, with the closest two clusters being joined (or “aggolermated”) at each step.

scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles.

The result of the process is a dendrogram, which shows the joining of clusters in tree form:

Warning in dist(fake_dat): NAs introduced by coercion

Dendrogram chart. With 5 observations.

2.1.1 Clusters from dendrogram

To produce a partition-style cluster assignment from the dendrogram, one must “cut” the tree at a chosen height:

Dendrogram chart. With 5 observations. A horizontal like at 0.6 cuts the dendrogram into 3 clusters.

The observations that remain joined in the dendrogram below the cut height are considered to be in a cluster together:

# A tibble: 5 × 2
  observation cluster_assignment
  <chr>                    <int>
1 a                            1
2 b                            1
3 c                            2
4 d                            2
5 e                            2

2.1.2 Methods of aggolmeration

At every step of the agglomeration, we measure distances between current clusters. With each cluster containing (possibly) multiple points, what does it mean to measure distance?

There are four common approaches to cluster-cluster distancing, aka “linkage”:

  1. single linkage: The distance between two clusters is the distance between the two closest observations.

  2. average linkage: The distance between two clusters is the average of all distances between observations in one cluster and observations in the other.

  3. complete linkage: The distance between two clusters is the distance between the two furthest observations.

  4. centroid method: The distance between two clusters is the distance between their centroids (geometric mean or median).

  5. Ward’s method: The distance between two clusters is proportional to the increase in error sum of squares (ESS) that would result from joining them. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.

It is also worth mentioning the McQuitty method, which retains information about previously joined clusters to measure future linkage distance. This method is currently supported for model fitting, but not for prediction, in tidyclust.

2.2 hier_clust specification in {tidyclust}

To specify a hierarchical clustering model in tidyclust, simply choose a value of num_clusters and (optionally) a linkage method:

hc_spec <- hier_clust(
  num_clusters = 3,
  linkage_method = "average"
)

hc_spec
Hierarchical Clustering Specification (partition)

Main Arguments:
  num_clusters = 3
  linkage_method = average

Computational engine: stats 

Currently, the only supported engine is stats::hclust(). The default linkage

2.3 Fitting hier_clust models

We fit the model to data in the usual way:

hc_fit <- hc_spec %>%
  fit(~ bill_length_mm + bill_depth_mm,
    data = penguins
  )

hc_fit %>%
  summary()
        Length Class      Mode
spec    4      hier_clust list
fit     7      hclust     list
elapsed 1      -none-     list
preproc 4      -none-     list

To produce a dendrogram plot, access the engine fit: (Although as we see below, dendrograms are often not very informative for moderate to large size datasets.)

hc_fit$fit %>% plot()

Dendrogram chart. With too many observations to be able to clearly see anything

We can also extract the standard tidyclust summary list:

hc_summary <- hc_fit %>% extract_fit_summary()

hc_summary %>% str()
List of 7
 $ cluster_names         : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
 $ centroids             : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ bill_length_mm: num [1:3] 38.8 47.9 56.6
  ..$ bill_depth_mm : num [1:3] 18.3 16.2 16.7
 $ n_members             : int [1:3] 153 184 5
 $ sse_within_total_total: num [1:3] 378.4 573.9 9.7
 $ sse_total             : num 1803
 $ orig_labels           : NULL
 $ cluster_assignments   : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 1 2 1 1 1 3 2 1 ...

Note that, although the hierarchical clustering algorithm is not focused on cluster centroids in the same way \(k\)-means is, we are still able to compute the geometric mean over the predictors for each cluster:

hc_fit %>% extract_centroids()
# A tibble: 3 × 3
  .cluster  bill_length_mm bill_depth_mm
  <fct>              <dbl>         <dbl>
1 Cluster_1           38.8          18.3
2 Cluster_2           47.9          16.2
3 Cluster_3           56.6          16.7

2.4 Prediction

To predict the cluster assignment for a new observation, we find the closest cluster. How we measure “closeness” is dependent on the specified type of linkage in the model:

  • single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.

  • complete linkage: The new observation is assigned to the cluster with the smallest maximum distances between training observations and the new observation.

  • average linkage: The new observation is assigned to the cluster with the smallest average distances between training observations and the new observation.

  • centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.

  • Ward’s method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.

hc_preds <- hc_fit %>% predict(penguins)

hc_preds
# A tibble: 342 × 1
   .pred_cluster
   <fct>        
 1 Cluster_1    
 2 Cluster_2    
 3 Cluster_1    
 4 Cluster_2    
 5 Cluster_1    
 6 Cluster_1    
 7 Cluster_1    
 8 Cluster_3    
 9 Cluster_2    
10 Cluster_1    
# ℹ 332 more rows

It’s important to note that there is no guarantee that predict() on the training data will produce the same results as extract_cluster_assignments(). The process by which clusters are created during the aggolmerations results in a particular partition; but if a training observation is treated as new data, it is predicted in the same manner as truly new information.

bind_cols(
  hc_preds,
  extract_cluster_assignment(hc_fit)
)
# A tibble: 342 × 2
   .pred_cluster .cluster 
   <fct>         <fct>    
 1 Cluster_1     Cluster_1
 2 Cluster_2     Cluster_2
 3 Cluster_1     Cluster_1
 4 Cluster_2     Cluster_2
 5 Cluster_1     Cluster_1
 6 Cluster_1     Cluster_1
 7 Cluster_1     Cluster_1
 8 Cluster_3     Cluster_3
 9 Cluster_2     Cluster_2
10 Cluster_1     Cluster_1
# ℹ 332 more rows

2.5 Reconciling partitions

Suppose we have produced cluster assignments from two models: a hierarchical clustering model with three clusters (as above) and a \(k\)-means clustering model with five clusters (below). How can we combine these assignments?

km_spec <- k_means(num_clusters = 5)
km_fit <- km_spec %>%
  fit(~., data = penguins)

km_preds <- predict(km_fit, penguins, prefix = "KM_")
hc_preds <- predict(hc_fit, penguins, prefix = "HC_")

We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.

tibble(
  hc = hc_preds$.pred_cluster,
  km = km_preds$.pred_cluster
) %>%
  count(hc, km)
# A tibble: 8 × 3
  hc    km        n
  <fct> <fct> <int>
1 HC_1  KM_1     74
2 HC_1  KM_2      4
3 HC_1  KM_3     78
4 HC_2  KM_2     78
5 HC_2  KM_3      1
6 HC_2  KM_4     31
7 HC_2  KM_5     58
8 HC_3  KM_4     18

However, they are not fully unrelated assignments. For example, all of KM_2 in the \(k\)-means assignment fell inside HC_1 for the hierarchical assignments.

Our goal is to relabel the five \(k\)-means clusters to match the three cluster names in the hierarchical output. This can be accomplished with reconcile_clusterings_mapping().

This function expects two vectors of cluster labels as input. The first is the label that will be matched, and the second is the label that will be recoded to the first.

If we are not trying to simply match names across two same-size clusterings, the option one_to_one must be set to FALSE.

reconcile_clusterings_mapping(
  primary = hc_preds$.pred_cluster,
  alternative = km_preds$.pred_cluster,
  one_to_one = FALSE
)
# A tibble: 342 × 3
   primary alt   alt_recoded
   <fct>   <fct> <chr>      
 1 HC_1    KM_1  HC_1       
 2 HC_2    KM_2  HC_2       
 3 HC_1    KM_3  HC_1       
 4 HC_2    KM_2  HC_2       
 5 HC_1    KM_1  HC_1       
 6 HC_1    KM_3  HC_1       
 7 HC_1    KM_1  HC_1       
 8 HC_3    KM_4  HC_2       
 9 HC_2    KM_4  HC_2       
10 HC_1    KM_1  HC_1       
# ℹ 332 more rows

In this example, we can see that KM_1, KM_2, KM_5 have been matched to HC_1; and KM_3 and KM_4 have been matched to HC_2. Notice that no clusters from the KM set were matched to HC_3; evidently, this is a small cluster that did not manifest clearly in the \(k\)-means clustering.