The Elbow Method is Wrong

This Commonly Used Methodology Rests on Unwarranted Assumptions and Should Never Be Used

Jason Nett
13 min readMay 5, 2022

Introduction

Unsupervised training is a category of methodologies used to explore the structure of datasets. As opposed to forms of supervised training where a model is trained on a dataset in which each data point is assigned some label (for classification) or correct numerical answer (for regression), unsupervised training methodologies assume no such labels or correct answers. Instead, a different question is being explored: “what structure exists inherently within the dataset?”

A simple and common form of unsupervised training is called “K-means clustering”, where the “K” refers to the number of “clusters” that are assumed to be inherent to a dataset. The K-means clustering methodology is an attempt to partition the data points of a dataset into 𝐾 discrete clusters, where 𝐾 is some positive integer value. As such, K-means clustering is sometimes denoted a form of “exclusive unsupervised training” because the dataset is divided into mutually exclusive subsets. K-means clustering does not, however, determine the optimal number of clusters inherent to a dataset. Instead, the number of clusters is chosen arbitrarily up front, then the dataset is partitioned into that number of categories in an optimized manner. Details of the K-means clustering methodology are not discussed here and we will assume familiarity with it from this point forward. Instead, we discuss in this essay the topic of determining the optimal number of clusters that are inherent to a dataset being explored with K-means clustering.

A common methodology for making this estimation is called the “Elbow Method”, which charts the sum of squared distances between every data point of a dataset and the 𝐾 centroids to which they are assigned during the K-means clustering algorithm against a range of chosen 𝐾 values, then estimating the optimal number of clusters to be where this chart transitions from a steep vertical drop to a gradual asymptotic approach to zero. The Elbow Method was always intended to be a heuristic rather than a rigorous argument.

Note also, while the motivation of the topics discussed in this essay are in response to the use of the “Elbow Method” in conjunction with K-means clustering, the arguments presented are applicable to any form of unsupervised training that assumes the validity of the Elbow Method.

Attempts to automate the “Elbow Method”, however, reveal a fatal — though subtle — flaw. In this essay, we analyze this methodology and argue that the Elbow Method should not be used to estimate the optimal number of clusters inherent to a dataset.

The Elbow Method

Before we address the flaws of the Elbow Method, we implement it for a series of highly idealized 2-dimensional datasets curated to contain a known number of data clusters without extraneous noise. The sklearn python package contains a method that generates inherently clustered datasets for development and testing purposes. The number of clusters, number of data points per cluster, number of dimensions of the space to generate data in, and the standard deviation of the distribution of points in each cluster (drawn from a normal distribution) are all configurable. The following script is an example of how sklearn (executed with python3.8) can be used to generate clustered datasets

import os
import csv
from sklearn.datasets import make_blobs
MIN_K = 1
MAX_K = 20
SAMPLES_PER_CLUSTER = 100
N_DIMENSIONS = 2
CLUSTER_STD = 0.25
OUTPUT_DIR = 'data'
OUTPUT_FILENAME_BASE = 'pseudodata'
for k in range(MIN_K, MAX_K + 1):
print('Generating k = ', k)
data_points, labels = make_blobs(
n_samples = k * SAMPLES_PER_CLUSTER,
n_features = N_DIMENSIONS,
centers = k,
cluster_std = CLUSTER_STD,
shuffle = True,
random_state = 0,
)
# Write the pseudodata to file
output_filename = os.path.join(
OUTPUT_DIR,
OUTPUT_FILENAME_BASE +\
'_clusters_' +\
str(k) +\
'_std_' +\
str(CLUSTER_STD) +\
'_samplesPerCluster_' +\
str(SAMPLES_PER_CLUSTER) +\
'.csv'
)
with open(output_filename, mode = 'w') as output_file:
csv_writer = csv.writer(output_file, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
for point in data_points:
csv_writer.writerow(point)

With this script we generate a dataset with three clusters, six clusters, and nine clusters, then demonstrate use of the Elbow Method.

Three-Cluster Pseudo-Dataset

(Image created by author) Three-cluster pseudo-dataset generated for development purposes. This dataset contains 2-dimensional data, 100 data points per cluster, and data points are drawn from a Normal Distribution with a standard deviation of 0.25.

Six-Cluster Pseudo-Dataset

(Image created by author) Six-cluster pseudo-dataset generated for development purposes. This dataset contains 2-dimensional data, 100 data points per cluster, and data points are drawn from a Normal Distribution with a standard deviation of 0.25.

Nine-Cluster Pseudo-Dataset

(Image created by author) Nine-cluster pseudo-dataset generated for development purposes. This dataset contains 2-dimensional data, 100 data points per cluster, and data points are drawn from a Normal Distribution with a standard deviation of 0.25.

Each of these plotted pseudo-datasets illustrates their inherently clustered nature and the red points are centroids corresponding to an execution of K-means clustering on each pseudo-dataset when the correct number of centroids 𝐾 is chosen. In practice, we do not know a priori the optimal number of clusters to assume for a dataset, but the K-means clustering algorithm requires that number to be specified up front.

We intentionally use pseudo-datasets that are clearly and tightly clustered for pedagogical and illustrative purposes. The correct answer to the question of how many clusters are inherent to each dataset is unambiguous, so a reliable methodology for determining that answer should be able to reproduce the correct answer built into the datasets.

To estimate the number of clusters inherent to a dataset with the “Elbow Method”, compute a chart of the sum of squared distances between each data point and the centroid to which it is assigned against the chosen number of centroids 𝐾 with which to execute K-means clustering. Intuitively, if the value of 𝐾 is less than the correct number of clusters 𝐾ₜᵣᵤₑ, then the sum of squared distances will drop quickly as 𝐾 approaches 𝐾ₜᵣᵤₑ because centroids tend to inhabit the spaces between clusters when too few are chosen. For example, if we execute the K-means clustering algorithm with 𝐾 = 3 for the pseudo-dataset with six clusters (𝐾ₜᵣᵤₑ = 6) above, then we see that two of the three centroids inhabit the space between clusters in this particular execution

(Image created by author) When too few centroids are chosen during K-means clustering for a clustered dataset, some centroids will tend to settle between clusters.

When 𝐾 is greater than 𝐾ₜᵣᵤₑ and increasing, however, the centroids start to share clusters and so the sum of squared distances decreases slowly because the centroids are already contained within clusters. In the limit that the number of centroids approaches the number of data points in the dataset, every single data point can be assigned its own centroid, in principle, at its own coordinate. Hence, the sum of squared distances drops to zero in the limit of large 𝐾.

For another example, if we execute the K-means clustering algorithm with 𝐾 = 9 for the same six cluster pseudo-dataset above, then we see that two of the clusters have multiple centroids sharing data points in this particular execution.

(Image created by author) When too many centroids are chosen during K-means clustering for the clustered dataset, some of the clusters tend to be shared by multiple centroids.

Therefore, the Elbow Method is to chart the sum of squared distances for an execution of the K-means clustering algorithm against the value of 𝐾 assumed, then see where value stops dropping quickly and starts dropping slowly to zero — the transition between those two behaviors. The “method” is to find the 𝐾 at which the “elbow” appears; that is, the highest curvature.

For the pseudo-dataset with three clusters, that chart looks like

(Image created by author) Elbow Method for a three-cluster pseudo-dataset. Chart the sum of squared distances from each data point to the centroid it is assigned against K, then visually identify the K at which there is a transition (the “elbow”).

and in this highly idealized scenario the method appears to produce an obviously correct answer — the sum of squared distances drops very rapidly until 𝐾 = 3 and then continues to drop at a markedly slower rate. In other words, the points appear to be fairly straight from 𝐾 = 1 to 𝐾 = 3 and from 𝐾 = 3 onward, with the “elbow” (or point of highest curvature) at 𝐾 = 3.

A similar pattern can be seen in the sum of squared distances charted against 𝐾 for the pseudo-dataset with six clusters

(Image created by author) Elbow Method for a six-cluster pseudo-dataset. Chart the sum of squared distances from each data point to the centroid it is assigned against K, then visually identify the K at which there is a transition (the “elbow”).

and again in the same chart for the pseudo-dataset with nine clusters

(Image created by author) Elbow Method for a nine-cluster pseudo-dataset. Chart the sum of squared distances from each data point to the centroid it is assigned against K, then visually identify the K at which there is a transition (the “elbow”).

where the visual inspection starts to become more ambiguous.

The Elbow Method is typically implemented by a visual inspection of a chart of this kind and then manually executing the K-means clustering (again) for the dataset being analyzed by manually choosing the number of centroids 𝐾 to use. At least superficially, this method does appear to provide a good estimate of the correct number clusters inherent to a dataset with an intuitive motivation. Visual inspection of a chart and the manual, arbitrary selection of a value along the horizontal axis is not, however, a rigorous mathematical argument for why such a method should reliably identify the number of clusters inherent to a dataset. Anecdotally observing that it seems to do so in exercises like the above is not sufficient.

Of course, that also begs the question: what if you are faced with a situation where you need or want to automate the process?

Automating the Elbow Method

Automating the selection of the 𝐾 value that best estimates 𝐾ₜᵣᵤₑ requires the translation of the visual inspection argument from the previous section to a more precise mathematical algorithm that can then be automated. Before we can define a decision-making process based on the sum of squared distances vs. 𝐾, we also need a model of its behavior.

The previous section illustrated the argument of the Elbow Method that the chart of the sum of squared distances between each data point and the centroid to which it is assigned against the chosen 𝐾 value during an execution of K-means clustering on the dataset should exhibit different behaviors on either side of the 𝐾 value corresponding to the correct number of clusters in the dataset. To the left of 𝐾 = 𝐾ₜᵣᵤₑ, the sum of squared distances appears to blow up as the vertical axis is approached. Indeed, in the limit of 𝐾 = 1 and the standard deviation of the clusters approaching infinity, the sum of squared distances necessarily also goes to infinity. To the right of 𝐾 = 𝐾ₜᵣᵤₑ, as described in the previous section, as 𝐾 increases towards (and beyond) the number of data points in the dataset, every data point can potentially be assigned its own centroid at the same coordinate. Thus, as 𝐾 approaches infinity, the sum of squared distances can be approximately modeled as asymptotically approaching zero. Acknowledging that 𝐾, and therefore also the sum of squared distances value being discussed, is indeed only defined for the integers 𝐾 ≥ 1, these behaviors can be approximated by a generic hyperbola

Generic hyperbola formula

where 𝑦 models the sum of squared distances, 𝑥 approximates the discrete 𝐾 values, and 𝑎 and 𝑏 are floating parameters determined by a machine learning fit. We are proceeding on the assumption that a hyperbola can sufficiently approximate the charted points for the sum of squared distances against 𝐾 and so the “elbow” of the fitted hyperbola approximates that of the the points in the corresponding chart.

Let’s pause and address the imprecision of talking about the “elbow” in both the sum of squared distances chart and hyperbolas. In both, the curve is changing fastest for some finite horizontal axis value, then effectively flattening (i.e. the curvature approaches zero as the slope approaches vertical) as it approaches the vertical axis to the left and flattening as it approaches the horizontal axis to the right. Between those two extremes, the “curvature” of the curve achieves some maximum and that is what is colloquially denoted as the “elbow”. Therefore, the “elbow method” is to find the horizontal axis value at which the curvature of a hyperbola curve fitted to the sum of squared distances chart is at a maximum. The “curvature” of a curve is a well-known formula from calculus

(Image created by author) Curvature formula for a single-variable function

Let’s compute the curvature formula for the same generic hyperbola form that we are fitting to the sum of squared distances chart. First, the first and second derivatives that we need are

(Image created by author) First and second derivatives of a generic hyperbola

Then we can use these derivatives to find the curvature formula for a general hyperbola

(Image created by author) Curvature formula for a generic hyperbola

Let’s observe this behavior for the case of 𝑎 = 𝑏 = 1 before proceeding to the fitted curves.

(Image created by author) Curvature formula for a hyperbola when both parameters are equal to 1.0

Notice how the curvature formula does appear to hit a maximum precisely where the hyperbola formula appears to be most curved

(Image created by author) Hyperbola and its curvature when a = 1.0 and b = 1.0. Notice how the curvature curve does peak where the slope of the hyperbola is changing the fastest and drops to zero in both directions where the hyperbola flattens. Hence, the maxima of the curvature indicates where the “elbow” of the hyperbola is.

Let’s proceed to fit the general hyperbola formula to each of the 𝐾ₜᵣᵤₑ = 3, 𝐾ₜᵣᵤₑ = 6, and 𝐾ₜᵣᵤₑ = 9 sum of squared distance charts above.

Fitted to 𝐾ₜᵣᵤₑ = 3:

FINAL PARAMETER FITS
a: 1516.4
b: 3.14
(Image created by author) Hyperbola fitted to the sum of squared distances vs. K chart for a dataset with three clusters

Fitted to 𝐾ₜᵣᵤₑ = 6:

FINAL PARAMETER FITS
a: 15252.5
b: 1.86
(Image created by author) Hyperbola fitted to the sum of squared distances vs. K chart for a dataset with six clusters

Fitted to 𝐾ₜᵣᵤₑ = 9:

FINAL PARAMETER FITS
a: 59344.4
b: 1.85
(Image created by author) Hyperbola fitted to the sum of squared distances vs. K chart for a dataset with nine clusters

With these fits, we can see that the subjective visual inspection of a chart for the maximum curvature begins to break down as 𝐾 increases. Whether using the sum of squared distances points themselves or the fitted hyperbola curve, the Elbow Method seems reasonable enough for 𝐾ₜᵣᵤₑ = 3, but starts to increasingly break down even for subjective visual inspections of the 𝐾ₜᵣᵤₑ = 6 and 𝐾ₜᵣᵤₑ = 9 cases. The true problem, however, is far more severe.

Returning to the curvature computation, let’s plot the fitted parabola for the 𝐾ₜᵣᵤₑ = 3 case (with 𝑎 = 1516.4 and 𝑏 = 3.14) against the curvature for the same expression

(Image created by author) Charted here are the parabola fitted to the sum of squared distances vs. K chart for the three-cluster pseudo-dataset and the curvature formula for that parabola. Note how the curvature maxima is closer to K = 8, not K = 3. Note that the curvature curve has been scaled up by a factor of 10 for visibility.

When we formally compute the curvature for the 𝐾ₜᵣᵤₑ = 3 case, we see that it actually hits a maximum approximately 𝑥 ~ 8. (Note that the curvature curve has been scaled up by a factor of 10 here to make it visible when charted against the hyperbola it describes).

Next, for the 𝐾ₜᵣᵤₑ = 6 case (with 𝑎 = 15252.5 and 𝑏 = 1.86) we find

(Image created by author) Charted here are the parabola fitted to the sum of squared distances vs. K chart for the six-cluster pseudo-dataset and the curvature formula for that parabola. Note how the curvature maxima is closer to K = 37, not K = 6. Note that the curvature curve has been scaled up by a factor of 1000 for visibility.

This case is even more pronounced. Instead of the maximum curvature occurring at or near a horizontal axis value of 6 as would be indicated if the Elbow Method were reliable, the maximum curvature occurs near 𝑥 ~ 37. (Note that the curvature curve has been scaled up by a factor of 1000 here to make it visible when charted against the hyperbola it describes).

Finally, let’s observe the 𝐾ₜᵣᵤₑ = 9 case (with 𝑎 = 59344.4 and 𝑏 = 1.85) in which we find

(Image created by author) Charted here are the parabola fitted to the sum of squared distances vs. K chart for the nine-cluster pseudo-dataset and the curvature formula for that parabola. Note how the curvature maxima is closer to K = 60, not K = 9. Note that the curvature curve has been scaled up by a factor of 1000 for visibility.

Still more, the true “elbow” — or horizontal axis value corresponding to the maximum curvature of a curve fitted to the sum of squared distances chart is closer to 60, not 9. A subjective visual inspection of the hyperbola fitted to the sum of squared distances does appear to put its “elbow” in line with the 60 value as well, in contrast to our original chart of both the sum of squared distances and the hyperbola fitted to it. (Note that the curvature curve has been scaled up by a factor of 1000 here to make it visible when charted against the hyperbola it describes).

Thus, our attempt to automate the Elbow Method has produced answers that are wildly incorrect.

What happened?

The Flaw in the Elbow Method

Reasonably intuitive arguments can go awry when subject to subjectivity and when we do not rigorously acknowledge or understand our assumptions. In this case, we originally assumed in the Elbow Method that the horizontal axis value corresponding to the charted maximum curvature (i.e. the “elbow”) could be ascertained by a subjective visual inspection as it was charted. Nevertheless, our eyes deceived us.

Let’s look again at the same 𝐾ₜᵣᵤₑ = 3 sum of squared distances chart and make note of the relative scales of the horizontal and vertical axes.

(Image created by author) Sum of squared distances between each data point and its assigned centroid charted against K. Notice the wildly different scaling between the vertical and horizontal axes. This warps the visual perception of where the “elbow” is.

The horizontal axis is scaled from 0 to 30 while the vertical axis scale is zoomed out to 0 to 1600. A visual inspection for the location of the “elbow” fails because the visual representation of the data itself is extremely distorted. Further, as shown in the previous section, the location of the “elbow” is nowhere near the true number of clusters in the underlying dataset when computed rigorously. The location of the “elbow” as judged by visual inspection changes with the relative scaling of the vertical and horizontal axes.

Fundamentally, even in cases like 𝐾ₜᵣᵤₑ = 3 illustrated here where use of the Elbow Method appears to yield the correct answer, it is doing so under the assumption that the data should be viewed in a manner that rescales just the vertical axis to a scale that fits a visual representation of the sum of squared distance values where all such values are visible in the chart and span the full vertical space of the chart. We know of no rigorous motivation to support this assumption.

Thus, the Elbow Method fails rigorous scrutiny and should not be trusted even as a heuristic.

Of course, if there’s a line of argumentation that we’ve missed, we’ll gladly update accordingly.

Conclusion

We have argued that intuitive methodologies that appear to produce correct answers should not be trusted in the absence of rigorous arguments for their veracity. The ubiquitous Elbow Method for determining the optimal number of clusters for a dataset is fundamentally flawed and should not be used. In lieu of this method, we advocate use of another method that we designed from scratch to address and replace the failed Elbow Method. We denote that the “Factionalization Method”, which is proposed here.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Jason Nett
Jason Nett

Written by Jason Nett

Data science executive, former high energy physicist, health and fitness enthusiast

No responses yet

Write a response