The Elbow Method is Wrong
This Commonly Used Methodology Rests on Unwarranted Assumptions and Should Never Be Used
Introduction
Unsupervised training is a category of methodologies used to explore the structure of datasets. As opposed to forms of supervised training where a model is trained on a dataset in which each data point is assigned some label (for classification) or correct numerical answer (for regression), unsupervised training methodologies assume no such labels or correct answers. Instead, a different question is being explored: “what structure exists inherently within the dataset?”
A simple and common form of unsupervised training is called “K-means clustering”, where the “K” refers to the number of “clusters” that are assumed to be inherent to a dataset. The K-means clustering methodology is an attempt to partition the data points of a dataset into 𝐾 discrete clusters, where 𝐾 is some positive integer value. As such, K-means clustering is sometimes denoted a form of “exclusive unsupervised training” because the dataset is divided into mutually exclusive subsets. K-means clustering does not, however, determine the optimal number of clusters inherent to a dataset. Instead, the number of clusters is chosen arbitrarily up front, then the dataset is partitioned into that number of categories in an optimized manner. Details of the K-means clustering methodology are not discussed here and we will assume familiarity with it from this point forward. Instead, we discuss in this essay the topic of determining the optimal number of clusters that are inherent to a dataset being explored with K-means clustering.
A common methodology for making this estimation is called the “Elbow Method”, which charts the sum of squared distances between every data point of a dataset and the 𝐾 centroids to which they are assigned during the K-means clustering algorithm against a range of chosen 𝐾 values, then estimating the optimal number of clusters to be where this chart transitions from a steep vertical drop to a gradual asymptotic approach to zero. The Elbow Method was always intended to be a heuristic rather than a rigorous argument.
Note also, while the motivation of the topics discussed in this essay are in response to the use of the “Elbow Method” in conjunction with K-means clustering, the arguments presented are applicable to any form of unsupervised training that assumes the validity of the Elbow Method.
Attempts to automate the “Elbow Method”, however, reveal a fatal — though subtle — flaw. In this essay, we analyze this methodology and argue that the Elbow Method should not be used to estimate the optimal number of clusters inherent to a dataset.
The Elbow Method
Before we address the flaws of the Elbow Method, we implement it for a series of highly idealized 2-dimensional datasets curated to contain a known number of data clusters without extraneous noise. The sklearn
python package contains a method that generates inherently clustered datasets for development and testing purposes. The number of clusters, number of data points per cluster, number of dimensions of the space to generate data in, and the standard deviation of the distribution of points in each cluster (drawn from a normal distribution) are all configurable. The following script is an example of how sklearn
(executed with python3.8
) can be used to generate clustered datasets
import os
import csv
from sklearn.datasets import make_blobsMIN_K = 1
MAX_K = 20
SAMPLES_PER_CLUSTER = 100
N_DIMENSIONS = 2
CLUSTER_STD = 0.25
OUTPUT_DIR = 'data'
OUTPUT_FILENAME_BASE = 'pseudodata'for k in range(MIN_K, MAX_K + 1):
print('Generating k = ', k)
data_points, labels = make_blobs(
n_samples = k * SAMPLES_PER_CLUSTER,
n_features = N_DIMENSIONS,
centers = k,
cluster_std = CLUSTER_STD,
shuffle = True,
random_state = 0,
)# Write the pseudodata to file
output_filename = os.path.join(
OUTPUT_DIR,
OUTPUT_FILENAME_BASE +\
'_clusters_' +\
str(k) +\
'_std_' +\
str(CLUSTER_STD) +\
'_samplesPerCluster_' +\
str(SAMPLES_PER_CLUSTER) +\
'.csv'
)
with open(output_filename, mode = 'w') as output_file:
csv_writer = csv.writer(output_file, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
for point in data_points:
csv_writer.writerow(point)
With this script we generate a dataset with three clusters, six clusters, and nine clusters, then demonstrate use of the Elbow Method.
Three-Cluster Pseudo-Dataset

Six-Cluster Pseudo-Dataset

Nine-Cluster Pseudo-Dataset

Each of these plotted pseudo-datasets illustrates their inherently clustered nature and the red points are centroids corresponding to an execution of K-means clustering on each pseudo-dataset when the correct number of centroids 𝐾 is chosen. In practice, we do not know a priori the optimal number of clusters to assume for a dataset, but the K-means clustering algorithm requires that number to be specified up front.
We intentionally use pseudo-datasets that are clearly and tightly clustered for pedagogical and illustrative purposes. The correct answer to the question of how many clusters are inherent to each dataset is unambiguous, so a reliable methodology for determining that answer should be able to reproduce the correct answer built into the datasets.
To estimate the number of clusters inherent to a dataset with the “Elbow Method”, compute a chart of the sum of squared distances between each data point and the centroid to which it is assigned against the chosen number of centroids 𝐾 with which to execute K-means clustering. Intuitively, if the value of 𝐾 is less than the correct number of clusters 𝐾ₜᵣᵤₑ, then the sum of squared distances will drop quickly as 𝐾 approaches 𝐾ₜᵣᵤₑ because centroids tend to inhabit the spaces between clusters when too few are chosen. For example, if we execute the K-means clustering algorithm with 𝐾 = 3 for the pseudo-dataset with six clusters (𝐾ₜᵣᵤₑ = 6) above, then we see that two of the three centroids inhabit the space between clusters in this particular execution

When 𝐾 is greater than 𝐾ₜᵣᵤₑ and increasing, however, the centroids start to share clusters and so the sum of squared distances decreases slowly because the centroids are already contained within clusters. In the limit that the number of centroids approaches the number of data points in the dataset, every single data point can be assigned its own centroid, in principle, at its own coordinate. Hence, the sum of squared distances drops to zero in the limit of large 𝐾.
For another example, if we execute the K-means clustering algorithm with 𝐾 = 9 for the same six cluster pseudo-dataset above, then we see that two of the clusters have multiple centroids sharing data points in this particular execution.

Therefore, the Elbow Method is to chart the sum of squared distances for an execution of the K-means clustering algorithm against the value of 𝐾 assumed, then see where value stops dropping quickly and starts dropping slowly to zero — the transition between those two behaviors. The “method” is to find the 𝐾 at which the “elbow” appears; that is, the highest curvature.
For the pseudo-dataset with three clusters, that chart looks like

and in this highly idealized scenario the method appears to produce an obviously correct answer — the sum of squared distances drops very rapidly until 𝐾 = 3 and then continues to drop at a markedly slower rate. In other words, the points appear to be fairly straight from 𝐾 = 1 to 𝐾 = 3 and from 𝐾 = 3 onward, with the “elbow” (or point of highest curvature) at 𝐾 = 3.
A similar pattern can be seen in the sum of squared distances charted against 𝐾 for the pseudo-dataset with six clusters

and again in the same chart for the pseudo-dataset with nine clusters

where the visual inspection starts to become more ambiguous.
The Elbow Method is typically implemented by a visual inspection of a chart of this kind and then manually executing the K-means clustering (again) for the dataset being analyzed by manually choosing the number of centroids 𝐾 to use. At least superficially, this method does appear to provide a good estimate of the correct number clusters inherent to a dataset with an intuitive motivation. Visual inspection of a chart and the manual, arbitrary selection of a value along the horizontal axis is not, however, a rigorous mathematical argument for why such a method should reliably identify the number of clusters inherent to a dataset. Anecdotally observing that it seems to do so in exercises like the above is not sufficient.
Of course, that also begs the question: what if you are faced with a situation where you need or want to automate the process?
Automating the Elbow Method
Automating the selection of the 𝐾 value that best estimates 𝐾ₜᵣᵤₑ requires the translation of the visual inspection argument from the previous section to a more precise mathematical algorithm that can then be automated. Before we can define a decision-making process based on the sum of squared distances vs. 𝐾, we also need a model of its behavior.
The previous section illustrated the argument of the Elbow Method that the chart of the sum of squared distances between each data point and the centroid to which it is assigned against the chosen 𝐾 value during an execution of K-means clustering on the dataset should exhibit different behaviors on either side of the 𝐾 value corresponding to the correct number of clusters in the dataset. To the left of 𝐾 = 𝐾ₜᵣᵤₑ, the sum of squared distances appears to blow up as the vertical axis is approached. Indeed, in the limit of 𝐾 = 1 and the standard deviation of the clusters approaching infinity, the sum of squared distances necessarily also goes to infinity. To the right of 𝐾 = 𝐾ₜᵣᵤₑ, as described in the previous section, as 𝐾 increases towards (and beyond) the number of data points in the dataset, every data point can potentially be assigned its own centroid at the same coordinate. Thus, as 𝐾 approaches infinity, the sum of squared distances can be approximately modeled as asymptotically approaching zero. Acknowledging that 𝐾, and therefore also the sum of squared distances value being discussed, is indeed only defined for the integers 𝐾 ≥ 1, these behaviors can be approximated by a generic hyperbola

where 𝑦 models the sum of squared distances, 𝑥 approximates the discrete 𝐾 values, and 𝑎 and 𝑏 are floating parameters determined by a machine learning fit. We are proceeding on the assumption that a hyperbola can sufficiently approximate the charted points for the sum of squared distances against 𝐾 and so the “elbow” of the fitted hyperbola approximates that of the the points in the corresponding chart.
Let’s pause and address the imprecision of talking about the “elbow” in both the sum of squared distances chart and hyperbolas. In both, the curve is changing fastest for some finite horizontal axis value, then effectively flattening (i.e. the curvature approaches zero as the slope approaches vertical) as it approaches the vertical axis to the left and flattening as it approaches the horizontal axis to the right. Between those two extremes, the “curvature” of the curve achieves some maximum and that is what is colloquially denoted as the “elbow”. Therefore, the “elbow method” is to find the horizontal axis value at which the curvature of a hyperbola curve fitted to the sum of squared distances chart is at a maximum. The “curvature” of a curve is a well-known formula from calculus

Let’s compute the curvature formula for the same generic hyperbola form that we are fitting to the sum of squared distances chart. First, the first and second derivatives that we need are

Then we can use these derivatives to find the curvature formula for a general hyperbola

Let’s observe this behavior for the case of 𝑎 = 𝑏 = 1 before proceeding to the fitted curves.

Notice how the curvature formula does appear to hit a maximum precisely where the hyperbola formula appears to be most curved

Let’s proceed to fit the general hyperbola formula to each of the 𝐾ₜᵣᵤₑ = 3, 𝐾ₜᵣᵤₑ = 6, and 𝐾ₜᵣᵤₑ = 9 sum of squared distance charts above.
Fitted to 𝐾ₜᵣᵤₑ = 3:
FINAL PARAMETER FITS
a: 1516.4
b: 3.14

Fitted to 𝐾ₜᵣᵤₑ = 6:
FINAL PARAMETER FITS
a: 15252.5
b: 1.86

Fitted to 𝐾ₜᵣᵤₑ = 9:
FINAL PARAMETER FITS
a: 59344.4
b: 1.85

With these fits, we can see that the subjective visual inspection of a chart for the maximum curvature begins to break down as 𝐾 increases. Whether using the sum of squared distances points themselves or the fitted hyperbola curve, the Elbow Method seems reasonable enough for 𝐾ₜᵣᵤₑ = 3, but starts to increasingly break down even for subjective visual inspections of the 𝐾ₜᵣᵤₑ = 6 and 𝐾ₜᵣᵤₑ = 9 cases. The true problem, however, is far more severe.
Returning to the curvature computation, let’s plot the fitted parabola for the 𝐾ₜᵣᵤₑ = 3 case (with 𝑎 = 1516.4 and 𝑏 = 3.14) against the curvature for the same expression

When we formally compute the curvature for the 𝐾ₜᵣᵤₑ = 3 case, we see that it actually hits a maximum approximately 𝑥 ~ 8. (Note that the curvature curve has been scaled up by a factor of 10 here to make it visible when charted against the hyperbola it describes).
Next, for the 𝐾ₜᵣᵤₑ = 6 case (with 𝑎 = 15252.5 and 𝑏 = 1.86) we find

This case is even more pronounced. Instead of the maximum curvature occurring at or near a horizontal axis value of 6 as would be indicated if the Elbow Method were reliable, the maximum curvature occurs near 𝑥 ~ 37. (Note that the curvature curve has been scaled up by a factor of 1000 here to make it visible when charted against the hyperbola it describes).
Finally, let’s observe the 𝐾ₜᵣᵤₑ = 9 case (with 𝑎 = 59344.4 and 𝑏 = 1.85) in which we find

Still more, the true “elbow” — or horizontal axis value corresponding to the maximum curvature of a curve fitted to the sum of squared distances chart is closer to 60, not 9. A subjective visual inspection of the hyperbola fitted to the sum of squared distances does appear to put its “elbow” in line with the 60 value as well, in contrast to our original chart of both the sum of squared distances and the hyperbola fitted to it. (Note that the curvature curve has been scaled up by a factor of 1000 here to make it visible when charted against the hyperbola it describes).
Thus, our attempt to automate the Elbow Method has produced answers that are wildly incorrect.
What happened?
The Flaw in the Elbow Method
Reasonably intuitive arguments can go awry when subject to subjectivity and when we do not rigorously acknowledge or understand our assumptions. In this case, we originally assumed in the Elbow Method that the horizontal axis value corresponding to the charted maximum curvature (i.e. the “elbow”) could be ascertained by a subjective visual inspection as it was charted. Nevertheless, our eyes deceived us.
Let’s look again at the same 𝐾ₜᵣᵤₑ = 3 sum of squared distances chart and make note of the relative scales of the horizontal and vertical axes.

The horizontal axis is scaled from 0 to 30 while the vertical axis scale is zoomed out to 0 to 1600. A visual inspection for the location of the “elbow” fails because the visual representation of the data itself is extremely distorted. Further, as shown in the previous section, the location of the “elbow” is nowhere near the true number of clusters in the underlying dataset when computed rigorously. The location of the “elbow” as judged by visual inspection changes with the relative scaling of the vertical and horizontal axes.
Fundamentally, even in cases like 𝐾ₜᵣᵤₑ = 3 illustrated here where use of the Elbow Method appears to yield the correct answer, it is doing so under the assumption that the data should be viewed in a manner that rescales just the vertical axis to a scale that fits a visual representation of the sum of squared distance values where all such values are visible in the chart and span the full vertical space of the chart. We know of no rigorous motivation to support this assumption.
Thus, the Elbow Method fails rigorous scrutiny and should not be trusted even as a heuristic.
Of course, if there’s a line of argumentation that we’ve missed, we’ll gladly update accordingly.
Conclusion
We have argued that intuitive methodologies that appear to produce correct answers should not be trusted in the absence of rigorous arguments for their veracity. The ubiquitous Elbow Method for determining the optimal number of clusters for a dataset is fundamentally flawed and should not be used. In lieu of this method, we advocate use of another method that we designed from scratch to address and replace the failed Elbow Method. We denote that the “Factionalization Method”, which is proposed here.