Introducing the “Factionalization Method” For Identifying the Inherent Number of Clusters in a Dataset
This Methodology Can Be Used With K-Means Clustering to Effectively Partition Datasets Into Discrete Subsets
Introduction
In data science, “unsupervised training” refers to a collection of techniques that study the inherent structure of datasets. As opposed to “supervised training”, in which each data point in a dataset is assigned some label (for classification problems) or numerical association (for regression problems), an objective of unsupervised training may be to partition a dataset into some discrete number of groupings in a manner that respects the structure of the dataset itself. If all the data points of a dataset are clustered into 𝑁 clearly identifiable groupings, then our objective is to partition that input dataset into 𝑁 different output datasets in a manner that respects the inherent groupings of the input dataset.
One very common methodology for partitioning a dataset into a discrete number of clusters is called “K-means clustering”. We do not cover the details of K-means clustering here and will assume the reader’s familiarity with it moving forward.
K-means clustering does not, however, determine the optimal number of clusters inherent to a dataset. Instead, the number of clusters is chosen arbitrarily up front, then the dataset is partitioned into that number of categories in an optimized manner. A methodology called the “Elbow Method” is commonly used in data science circles to estimate the correct number of clusters in a dataset, but we find that method to be inaccurate, lack mathematical rigor, and fails attempts to be automated. In fact, attempts to automate the Elbow Method lead directly to our arguments against its use and the formulation of a different methodology explored in this essay. To review our refutation of the Elbow Method, please see
https://jasonnett.medium.com/the-elbow-method-is-wrong-9f5d43445821
In the following essay, we introduce a different methodology for estimating the number of clusters inherent to a dataset which we denote the “Factionalization Method”. Since in K-means clustering we must specify the 𝐾 value to use a priori, we endeavor to define a metric that measures the extent to which a dataset is partitioned into “factions” for a given value of 𝐾. If 𝐾ₜᵣᵤₑ is the correct number of clusters in a dataset, then the metric will be defined in a manner that is customized to achieve a maximum for 𝐾 = 𝐾ₜᵣᵤₑ when charted over a range of values of 𝐾. Further, in the spirit of automating this method, a function that can be fitted to the metric is introduced and the estimated optimal 𝐾 value for a dataset is defined at the maxima of this function.
The form of this metric that we are proposing is intuitively motivated, more mathematically rigorous than the Elbow Method, and can be automated (assuming a reliable machine learning fit to be explored below). Most importantly, it produces a reasonably good estimate for the number of clusters inherent to a dataset. We do not claim, however, that the present form of the Factionalization Metric is its final or optimal form. Our research into improving its design and performance is ongoing and we have several such topics queued. Nevertheless, we believe it has reached a usable form and so introduce it to the data science community here welcoming an open discussion and critical review.
Before proceeding further, we would be remiss to not at least mention the “Silhouette Method”. This essay neither supports nor disputes any particular form of the Silhouette Method as written elsewhere. Rather, the method proposed here aims to be more computationally efficient and capable of automation in highly noisy and high-dimension data scenarios. The Silhouette Method involves the computation of a score for each data point in a dataset that involves the distance of every data point to every other data point in the dataset, which is extremely computationally expensive as the size of the dataset increases. The proposed metric below is designed to measure how inherently clustered a dataset is for a given value of 𝐾 in a manner that uses distances between data points and the centroids, not the distances between each data point and every other data point. Further, the machine learning fit of a curve to the shape of the proposed metric ensures the viability of the methodology proposed here for realistic datasets that may have levels of noise that translate into the metric itself. These effects will be clarified below.
Pseudo-data Used in this Study
The sklearn
python package contains a method that generates inherently clustered datasets for development and testing purposes. The number of clusters, number of data points per cluster, number of dimensions of the space to generate data in, and the standard deviation of the distribution of points in each cluster (drawn from a normal distribution) are all configurable. The following script is an example of how sklearn
(executed with python3.8
) can be used to generate clustered datasets
import os
import csv
from sklearn.datasets import make_blobsMIN_K = 1
MAX_K = 20
SAMPLES_PER_CLUSTER = 100
N_DIMENSIONS = 2
CLUSTER_STD = 0.25
OUTPUT_DIR = 'data'
OUTPUT_FILENAME_BASE = 'pseudodata'for k in range(MIN_K, MAX_K + 1):
print('Generating k = ', k)
data_points, labels = make_blobs(
n_samples = k * SAMPLES_PER_CLUSTER,
n_features = N_DIMENSIONS,
centers = k,
cluster_std = CLUSTER_STD,
shuffle = True,
random_state = 0,
)# Write the pseudodata to file
output_filename = os.path.join(
OUTPUT_DIR,
OUTPUT_FILENAME_BASE +\
'_clusters_' +\
str(k) +\
'_std_' +\
str(CLUSTER_STD) +\
'_samplesPerCluster_' +\
str(SAMPLES_PER_CLUSTER) +\
'.csv'
)
with open(output_filename, mode = 'w') as output_file:
csv_writer = csv.writer(output_file, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
for point in data_points:
csv_writer.writerow(point)
With this script we generate a dataset with three clusters, six clusters, and nine clusters. While these were not the only datasets we used to design and test the methodology to follow, we will demonstrate our results in terms of these three datasets. We do also note that the following three pseudo-datasets are the same datasets that we used for our refutation of the Elbow Method. This was intentional for comparison purposes.
Three-cluster pseudo-dataset

Six-cluster pseudo-dataset

Nine-cluster pseudo-dataset

Each of these plotted pseudo-datasets illustrates their inherently clustered nature and the red points are centroids corresponding to an execution of K-means clustering on each pseudo-dataset when the correct number of centroids 𝐾 is chosen. In practice, we do not know a priori the optimal number of clusters to assume for a dataset, but the K-means clustering algorithm requires that number to be specified up front.
We intentionally use pseudo-datasets that are clearly and tightly clustered for pedagogical and illustrative purposes. The correct answer to the question of how many clusters are inherent to each dataset is unambiguous, so a reliable methodology for determining that answer should be able to reproduce the correct answer built into the datasets.
In practice, of course, the datasets that this methodology would be used on may be much higher dimensional and contain far more noise. There may not even be inherent clusters. Nevertheless, we consider performance on these pseudo-datasets to provide sanity check demonstrations that our methodology must be able to perform correctly on before moving to more realistic data scenarios. We will explore the performance on more realistic datasets in subsequent essays.
Intuitive Motivation
Our objective is to define a measurement of a dataset that is maximized when the number of centroids chosen during K-means clustering 𝐾 is equal to the true number of clusters inherent to the dataset 𝐾ₜᵣᵤₑ. Observe two characteristics of a dataset when 𝐾 = 𝐾ₜᵣᵤₑ:
- The mean distance from each data point to the centroid it is assigned to is relatively small (we have tight clusters)
- The mean distance from each data point to the centroids it is not assigned to is relatively large (we have well-separated clusters)

If too few centroids are chosen (𝐾 < 𝐾ₜᵣᵤₑ), then at least some centroids tend to inhabit the space between clusters after the K-means clustering has completed. This will sharply increase the mean distance from data points to their assigned centroids because some of the centroids no longer reside within any cluster.

If too many centroids are chosen (𝐾 > 𝐾ₜᵣᵤₑ), then at least some centroids tend to share the points of clusters after the K-means clustering has completed. This means the mean distance from each data point to its assigned centroid will decrease, but only slightly because some of those data points are still assigned to a centroid that resides within the same cluster (just a little closer than the ideal scenario). Meanwhile, the mean distance from each data point to unassigned centroids now decreases sharply because some of those distances are to centroids that now reside within the same cluster of points.

Our objective is to use these two aspects of clustered data to construct a metric that is maximized for 𝐾 = 𝐾ₜᵣᵤₑ.
Precursors to a Quantitative Metric For Predicting The Optimal Number of Clusters
Following the intuitive argument from the previous section, let’s now start computing and charting quantities that exhibit the behaviors described for the three pseudo-datasets.
First, consider the mean distance between each data point and the centroid to which it is assigned for a range of 𝐾 values (i.e. the number of centroids chosen during K-means clustering). Those values are then renormalized to the magnitude of the highest value for reasons to be addressed shortly. In each of the following cases, we see a sharp rise in that mean distance value for 𝐾 < 𝐾ₜᵣᵤₑ moving to the left.
Three-cluster pseudo-dataset

Six-cluster pseudo-dataset

Nine-cluster pseudo-dataset

Note that the behavior of the points in the mean distance to the assigned centroid charts approaches both the vertical axis asymptotically and the horizontal axis asymptotically. As such, we will model the behavior of these points by a generic hyperbola

Second, recall that the mean distance from each data point to unassigned centroids now decreases sharply moving to the right for 𝐾 > 𝐾ₜᵣᵤₑ. As such, let’s compute the inverse mean distance from each data point to the centroids they are not assigned to. This value is also then renormalized to the magnitude of the highest value (we do this so that this value and the mean distance to assigned centroids value can be compared with equal weighting). Observe the sharp rise for 𝐾 > 𝐾ₜᵣᵤₑ
Three-cluster pseudo-dataset

Six-cluster pseudo-dataset

Nine-cluster pseudo-dataset

Now we need a model for the behavior of these points in the mean inverse distance to unassigned centroids charts. Let’s start conceptually with a dataset of some number of clusters and 𝐾 = 𝐾ₜᵣᵤₑ. As we re-run the model with fewer centroids (lower 𝐾) from there, we are still measuring the distance from data points to centroids that do not reside in the same cluster as the data point and there will be fewer centroids spread out through the space of the dataset. So, this distance will marginally increase until 𝐾 = 2, when there are just two centroids. The value of this metric will just then be the mean distance of the data points assigned to one centroid to the other centroid, and vice versa. So the inverse of that is a value that mildly decreases to some finite value.
Meanwhile, as we increase the number of centroids from the true number of clusters, multiple centroids start getting assigned to single clusters of data. Now, some of the distances from data points to unassigned centroids are much shorter because the unassigned centroid resides in the same cluster. So, the mean distance from data points to unassigned centroids starts to sharply decrease as the number of centroids 𝐾 increases past the true number of clusters in the dataset, meaning the inverse of that value will sharply increase. As 𝐾 goes to infinity (or, more practically, every single data point also has its own centroid), this value converges to the mean distance from every data point to every other data point. This sounds like the right side of a logistic sigmoid, which is what we’ll use to model the mean inverse distance from data points to the centroids they are not assigned to.

Introducing the “Factionalization” Metric
We can combine these two quantities to define a metric that achieves a detectable extremum at 𝐾 = 𝐾ₜᵣᵤₑ (or at least a reasonable estimate). While there are a variety of ways to combine these two quantities — and the best way to do so is certainly debatable — we introduce just one for the purposes of this essay.
Let 𝑋 be a set of data points and 𝐾 be the chosen number of centroids during a particular execution of the K-means clustering algorithm on dataset 𝑋. Upon completion of the K-means clustering algorithm, every data point 𝑥∈X is assigned to some set 𝐺ᵢ of data points associated with a centroid 𝑔ᵢ where the set {𝑔ᵢ} ∀𝑖∈[1,…,𝐾] is the set of 𝐾 centroids. Thus, the set of data points 𝑋 has been partitioned into 𝐾 mutually orthogonal sets 𝐺₁,…,𝐺ₖ as a result of the K-means clustering algorithm. Now let’s define the mean distance between every data point and the centroid to which it is assigned, denoted here as 𝛼. Assume 𝑑(𝑎, 𝑏) is the Euclidean distance between two arbitrary coordinates 𝑎 and 𝑏. Let |𝑋| be the cardinality of set 𝑋.

In this, we’re summing over the 𝐾 centroids, then for each centroid summing through each data point assigned to it, then computing the distance between the data point and the centroid to which it is assigned. Finally, we divide by the cardinality of the dataset to get that mean distance value.
Let’s also define the mean inverse distance between every data point 𝑥ⱼ∈𝐺ᵢ ∀𝑖∈[1,…,𝐾] and every centroid to which 𝑥ⱼ is not assigned {𝑔ₖ}\𝑔ᵢ (all centroids except centroid 𝑔ᵢ), denoted 𝛽. Note also in this case that the number of distance values being computed and averaged is the number of points in dataset 𝑋 times one less than the number of centroids |{𝑔ₖ}|.

In this, we’re summing over the 𝐾 centroids, then for each centroid summing through each data point assigned to it, then for each data point we’re computing the inverse distance to every other centroid to which it is not assigned.
Recall also from the previous section that we are renormalizing both 𝛼 and 𝛽 to the magnitude of the highest value computed for each, respectively. In effect then, the highest value of each when actually computed across a range of values of 𝐾 is defined to be 1.0 with the rest renormalized accordingly. So, before proceeding


The “Factionalization” quantity we are defining here is then the inverse of the sum of the renormalized 𝛼 and 𝛽.

This metric is designed to achieve its maximum value at 𝐾=𝐾ₜᵣᵤₑ when charted over a range of values of 𝐾.
Fitting a Curve to the Factionalization Metric Points
We have a metric defined in Eqn. (7) to estimate the correct number of clusters inherent to a dataset. However, naively choosing the 𝐾 value at which 𝐹 is maximized can be problematic if the computed factionalization values are noisy. In the case of the highly idealized pseudo-datasets we’re working with, this will not be apparent. However, when this technique is applied to more realistic datasets, we will see the necessity for a more robust method for estimating the 𝐾 value at which 𝐹 is maximized.
Recall from Eqn. (1) above that we are approximating 𝛼 by a generic hyperbola and from Eqn. (2) above that we are approximating 𝛽 by a logistic sigmoid. We are now going to sum and invert those two functional forms to have a formula for a curve that can be fit to 𝐹(𝐾). Estimating the value of 𝐾ₜᵣᵤₑ will then be a matter of identifying the first maxima of that curve. Let 𝑓(𝑥) be the curve that is fitted to 𝐹(𝐾)

There are six parameters that float during fitting and they exhibit the following behaviors (under the inverse):
- 𝑎: Stretch the logistic sigmoid vertically
- 𝑏: Stretch the logistic sigmoid horizontally
- 𝑐: Shift the logistic sigmoid horizontally
- 𝑑: Stretch the hyperbola vertically
- 𝑔: Scale the strength of the coefficient of the hyperbola
- ℎ: Shift the sum of the logistic and hyperbola curve vertically
That curve is then inverted by the overall coefficient to −1.
In the following three charts, the factionalization metric is computed and charted for 2 ≤ 𝐾 ≤ 30. The curve of Eqn. (8) is then fitted to those points. Finally, a vertical green line indicates the value of 𝐾ₜᵣᵤₑ and a vertical blue line indicates the automatically-detected first maxima of the fitted curve 𝑓(𝑥) (the green line is not visible because it is underneath the blue vertical line in these examples, indicating a correct prediction).
Three-cluster pseudo-dataset

Six-cluster pseudo-dataset

Nine-cluster pseudo-dataset

We see in the cases of 𝐾ₜᵣᵤₑ = 3 and 𝐾ₜᵣᵤₑ = 6 that a very sharp maxima occurs at the correct values for these highly contrived datasets containing a known number of clusters and no noise. Admittedly, the sharpness of the maxima in those two cases is so sharp that the fit of 𝑓(𝑥) had some difficulty with fully contorting itself to the points in the chart. The 𝐾ₜᵣᵤₑ = 9 case is an example of why we cannot blindly choose the 𝐾 value at which 𝐹(𝐾) has its maximum value — due to some noise in the data of the chart (K-means clustering does not produce deterministic results from one execution to the next given consistent starting parameters, especially as 𝐾 increases) there are a few points other than 𝐾 = 𝐾ₜᵣᵤₑ = 9 for which higher values of 𝐹(𝐾) are computed, but the fitted curve 𝑓(𝑥) still finds the correct answer. This fit will become even more important when the methodology introduced here is used on more realistic datasets, in which the shape of 𝐹(𝐾) will often be flatter and noisier.
Thus, the 𝑥 = 𝐾 value at which 𝑓(𝑥) achieves its first (typically only) maxima reliably estimates 𝐾ₜᵣᵤₑ for at least 2 ≤ 𝐾 ≤ 9. Anecdotally, we find the estimate to be decent up to 𝐾 = 20 as well, but may contain a slight bias that underestimates 𝐾ₜᵣᵤₑ that manifests in that range. As such, we are continuing to research improvements to the methodology discussed here. We anticipate that we will publish improvements in the future.
Conclusion
We have introduced a novel methodology for estimating the number of clusters inherent to a dataset for use in unsupervised training. The motivation for designing this methodology was to have an accurate and fully automated process for computing this estimation. The Factionalization metric introduced here was designed from first principles with an intuitive foundation to yield a correct answer. We do note, however, that the methodology is a first usable form, not a final one. Adjustments and variations are likely to follow as our research and development continues. We welcome all legitimate questions and critiques.