# Automatic selection of k in k-means clustering

Note: This post was originally intended as a segment in the Natural Language Generation series documenting my current project. However, as I was writing, I realized that framing it as part of that series required sharing information that was too close to proprietary for comfort, so I’ve broken it off from the rest.

Automated selection of $k$ in $k$-means clustering remains an outstanding problem in machine learning. There are a number of algorithms that allege to do a decent job under certain problem constraints (see this and this), but the only truly general and widely accepted solution to the problem of selecting $k$ is the elbow method. But the elbow method still requires and expert human to look at a line plot, a scatter plot, and investigate whether the clusters are real-world intuitive. And that’s suboptimal; this is something we would often prefer to have a computer do for us.

While the problem of solving for $k$ is difficult in general, the same isn’t necessary true for a very narrowly defined problem space. The difficulty of the general problem is largely responsible for it’s lack of solution or widespread coverage in the literature, but we don’t face general problems very often in practice. The vast majority of the time, we have a particular business domain, a particular goal, and a particular data source. By narrowing the scope of the problem from the general to the specific, it becomes a lot easier to solve by identifying some heuristics that mimic the way a human analyst or scientist would think about $k$-selection in the wild.

What we want is some set of computations $C$ involving parameters $\Theta$ that can approximate expert thinking for a specific problem, where $C$ is an abstraction away from the way we should reason about $k$ in general and $\Theta$ is an abstraction away from the particular considerations that an expert would make in choosing $k$ for a specific problem.

### What is C?

This is the general part of the problem. So, let’s just look at what’s happening in the elbow method (and associated checks) real quick.

#### The Elbow Method

Under the elbow method, the goal is to judgmentally select a number of clusters $k$ such that: (1) the decrease in total within-group distance in moving from $k-1$ to $k$ is large for all $n < k$, but (2) the decrease in total within-group distance in moving from $k$ to $k+1$ is not large for any $n >= k$. In practice, the generality of (1) to all $n < k$ and (2) to all $n >= k$ may be impossible if the $k$ vs. distance plot has kinks in it; in this case, we can relax the generality requirements. In a more formal setting, I’d give a complete discussion of these cases, but for now and in this setting, we’re just going to handwave over them and let people figure out how to formalize those situations on their own.

Normally, we select $k$ visually, by looking for the “elbow” of the plot as shown above by the red circle. This is what we want to formalize. Notice that we aren’t so much concerned with the size of the distance at $k$ as we are with the left-and-right sided derivatives at $k$, or the size of the change in $k$ when moving to (from) $k+1$ ($k-1$).

The actual derivatives might be positive or negative depending on how the distance metric is expressed. I usually express the distance directly, which yields negative derivatives; the above plot expresses distance in terms of explained variance, yielding positive derivative. In either case, we just want the right-handed derivative to be much closer to zero than the left-handed size; normally, and with $D$ as the total distance, $\left|\frac{\partial D}{\partial k_-}\right| > 1$ and $\left|\frac{\partial D}{\partial k_+}\right| < 1$, and looking at a plot gives that to us visually without really giving thought to derivatives at all.

Calculating derivatives can be annoying in practice, especially when you don’t know the form of total distance as a function of $k$. But we have two highly interpretable measures of the change in distance. The first is the difference $D(k) - D(k - 1)$, where $D$ is distance; the second is the ratio $D(k)/D(k-1)$. Letting $\eta_k$ be either the difference or ratio, depending on which you choose, the formal property we want is:

$\frac{\eta_{k}}{\eta_{k - 1}} < \theta_D$

And now the problem is a matter of selecting the right $\theta_D$, which belongs to another part of the problem. This is really a heuristic approximation to setting an upper bound on the right second derivative; in some problems, it might also be useful to set a lower bound on the left second derivative. To keep things simple, we don’t do that here.

### The Scatter Plot

For me, the main point of looking at a scatter plot of my clusters is to ensure that they capture groups that I would naturally call clusters. We can divide any large set up into subsets however we like, but that’s subsetting or segmenting, not clustering. Kmeans itself does a lot of work for us in making sure that the items inside each cluster nearer to each other, on average, than they are to items in another cluster, but sometimes it doesn’t take us all the way. Consider this scatter plot, which shows the result of $3$-means clustering on the Iris dataset:

It’s pretty clear that we want to regard the red cluster as its own thing. But it’s really not so clear that we want to separate blue and green. The silhouette score is the measure we want here. Fore each datum $i$, the silhouette score is defined as

$\frac{b(i) - a(i)}{\max\{a(i),b(i)\}},$

where $a(i)$ is the mean within-cluster distance to $i$. To understand $b(i)$, let $d_j(i)$ be the mean distance between $i$ and items in cluster $j$. Then $b(i)$ is the minimum $d_j(i)$ across all $j$, and $j$ represents the second-best cluster for $i$.

The silhouette ranges from $-1$ to $1$, where values close to $1$ mean that $i$ is properly clustered, values near $0$ mean that it’s difficult to really say whether $i$ is properly clustered, and values near $-1$ mean that $i$ is poorly clustered.

If we let $s_\mu$ be the mean silhouette across all observations, then the constraint we want to use is:

$s_\mu\geq\theta_{silh}$

And once again, the problem is reduced to selection of a $\theta_{silh}$ appropriate to your specific environment. Scikit-learn and R’s cluster package both include functions that will calculate the silhouette score for you.

### Other constraints

Really, you can add any constraints you like. In my case, we needed the cluster means to be significantly far apart for the cluster to be salient to the user, so I included a constraint:

$\left| i_\mu - j_\mu \right| \geq \theta_{\mu},$

where $i_\mu$ and $j_\mu$ are the mean points clusters $i$ and $j$. You might also want to limit the range of viable $k$ or constrain the between variance in some way. Whatever you need to tailor things to your specific case.

### Finding the theta vector

I’m afraid this part is mostly judgmental and still requires a human expert. But, the problem of choosing suitable $\theta$ only has to be done once for each clustering problem, not every time we want to refresh the clusters. I don’t see a simple and obvious way to tune them algorithmically. Sure, we could have human experts label the data and minimize some error function, but if we are able to have experts label the data for us, then we’ve just turned this into a supervised learning problem, and we won’t be using $k$-means.

The real takeaway here is that even in cases when a problem needs a bit of attention from a human expert, we can still achieve a high level of automation by formalizing individual criteria used by humans in making these decisions and automating heuristics that mimic their reasoning.