Note: This post was originally intended as a segment in the Natural Language Generation series documenting my current project. However, as I was writing, I realized that framing it as part of that series required sharing information that was too close to proprietary for comfort, so I’ve broken it off from the rest.
Automated selection of in -means clustering remains an outstanding problem in machine learning. There are a number of algorithms that allege to do a decent job under certain problem constraints (see this and this), but the only truly general and widely accepted solution to the problem of selecting is the elbow method. But the elbow method still requires and expert human to look at a line plot, a scatter plot, and investigate whether the clusters are real-world intuitive. And that’s suboptimal; this is something we would often prefer to have a computer do for us.
While the problem of solving for is difficult in general, the same isn’t necessary true for a very narrowly defined problem space. The difficulty of the general problem is largely responsible for it’s lack of solution or widespread coverage in the literature, but we don’t face general problems very often in practice. The vast majority of the time, we have a particular business domain, a particular goal, and a particular data source. By narrowing the scope of the problem from the general to the specific, it becomes a lot easier to solve by identifying some heuristics that mimic the way a human analyst or scientist would think about -selection in the wild.
What we want is some set of computations involving parameters that can approximate expert thinking for a specific problem, where is an abstraction away from the way we should reason about in general and is an abstraction away from the particular considerations that an expert would make in choosing for a specific problem.
What is C?
This is the general part of the problem. So, let’s just look at what’s happening in the elbow method (and associated checks) real quick.
The Elbow Method
Under the elbow method, the goal is to judgmentally select a number of clusters such that: (1) the decrease in total within-group distance in moving from to is large for all , but (2) the decrease in total within-group distance in moving from to is not large for any . In practice, the generality of (1) to all and (2) to all may be impossible if the vs. distance plot has kinks in it; in this case, we can relax the generality requirements. In a more formal setting, I’d give a complete discussion of these cases, but for now and in this setting, we’re just going to handwave over them and let people figure out how to formalize those situations on their own.
Normally, we select visually, by looking for the “elbow” of the plot as shown above by the red circle. This is what we want to formalize. Notice that we aren’t so much concerned with the size of the distance at as we are with the left-and-right sided derivatives at , or the size of the change in when moving to (from) ().
The actual derivatives might be positive or negative depending on how the distance metric is expressed. I usually express the distance directly, which yields negative derivatives; the above plot expresses distance in terms of explained variance, yielding positive derivative. In either case, we just want the right-handed derivative to be much closer to zero than the left-handed size; normally, and with as the total distance, and , and looking at a plot gives that to us visually without really giving thought to derivatives at all.
Calculating derivatives can be annoying in practice, especially when you don’t know the form of total distance as a function of . But we have two highly interpretable measures of the change in distance. The first is the difference , where is distance; the second is the ratio . Letting be either the difference or ratio, depending on which you choose, the formal property we want is:
And now the problem is a matter of selecting the right , which belongs to another part of the problem. This is really a heuristic approximation to setting an upper bound on the right second derivative; in some problems, it might also be useful to set a lower bound on the left second derivative. To keep things simple, we don’t do that here.
The Scatter Plot
For me, the main point of looking at a scatter plot of my clusters is to ensure that they capture groups that I would naturally call clusters. We can divide any large set up into subsets however we like, but that’s subsetting or segmenting, not clustering. Kmeans itself does a lot of work for us in making sure that the items inside each cluster nearer to each other, on average, than they are to items in another cluster, but sometimes it doesn’t take us all the way. Consider this scatter plot, which shows the result of -means clustering on the Iris dataset:
It’s pretty clear that we want to regard the red cluster as its own thing. But it’s really not so clear that we want to separate blue and green. The silhouette score is the measure we want here. Fore each datum , the silhouette score is defined as
where is the mean within-cluster distance to . To understand , let be the mean distance between and items in cluster . Then is the minimum across all , and represents the second-best cluster for .
The silhouette ranges from to , where values close to mean that is properly clustered, values near mean that it’s difficult to really say whether is properly clustered, and values near mean that is poorly clustered.
If we let be the mean silhouette across all observations, then the constraint we want to use is:
And once again, the problem is reduced to selection of a appropriate to your specific environment. Scikit-learn and R’s cluster package both include functions that will calculate the silhouette score for you.
Really, you can add any constraints you like. In my case, we needed the cluster means to be significantly far apart for the cluster to be salient to the user, so I included a constraint:
where and are the mean points clusters and . You might also want to limit the range of viable or constrain the between variance in some way. Whatever you need to tailor things to your specific case.
Finding the theta vector
I’m afraid this part is mostly judgmental and still requires a human expert. But, the problem of choosing suitable only has to be done once for each clustering problem, not every time we want to refresh the clusters. I don’t see a simple and obvious way to tune them algorithmically. Sure, we could have human experts label the data and minimize some error function, but if we are able to have experts label the data for us, then we’ve just turned this into a supervised learning problem, and we won’t be using -means.
The real takeaway here is that even in cases when a problem needs a bit of attention from a human expert, we can still achieve a high level of automation by formalizing individual criteria used by humans in making these decisions and automating heuristics that mimic their reasoning.