Tuesday, October 13, 2020

What is the right number of clusters?

This is a question that comes up *all* the time when people first encounter clustering. Sadly, the information that they get about how to choose the right number of clusters is exactly wrong in a lot of the situations. The single right answer is that you should choose the parameterization of your clustering that optimizes whatever downstream use you have for clustering.

Far too often, that downstream use is simply ignored and advice like finding the knee in the residual is given as if it applied universally.
There are some really cool algorithms that apparently let the number of clusters come from the data. For instance, check this tutorial on Dirichlet Processes and infinite mixture modelingOr take a look at this description of DP-means which applies these ideas of Bayesian modeling to the problem of clustering.

These are very exciting and you can use these methods for some pretty impressive practical results. For instance, this really fast version of $k$-means uses something akin to DP-means as an internal step to build good sketches of an empirical distribution that can give you a high quality clustering (probably).

But there is a really dark side here. The fact is that you are just moving the arbitrary decision about the number of clusters into an arbitrary decision about which prior distribution to pick.

The fact is, there is no universal best number of clusters for $k$-means clustering. You have to make assumptions and those assumptions may run very, very counter to reality. Whether those assumptions concern a heuristic choice of $k$ for $k$-means, or a heuristic choice of a scaling parameter $\lambda$ for DP-means, or a base distribution $G_0$ for a Dirichlet process, or even just deciding which data to ignore, you are still making a strong decision about what is best and that decision does not come from the data. Each of these decisions, together with choices like how you define distance are just different ways of encoding our prejudices.

And it is completely necessary to encode our prejudices.

The real-world term for these assumptions is called “domain knowledge”. You need to know why you are clustering data to understand how to cluster it.

As a classic example, the data in the figure below are all the exact same synthetic dataset viewed with different definitions of distance. Should there be one, two, three or four clusters? Or should there be 200? Each of these is a correct answer for different applications.

(see how different definition of distance changes our view of clustering) for the R script that generated these figures)

So let's step through some practical thoughts on clustering:

- first, think about what you want to do. Is the clustering going to make pretty pictures? Or are you using the clustering as an input to a machine learning system?

- second, think about distance and scaling. It is very unlikely that the data as you get it will be what you want. Some variables might be nearly insignificant and should be removed or scaled down. Others might be important, but measured on a very different scale from the rest of the variables you observe. There might well be other ways to improve the sense of proximity than Euclidean distance on your original data. That takes us to the next point.

- consider dimensionality reduction before clustering. Algorithms like UMAP can vastly simply the structure of your data (if you have a decent metric). The sophistication of your projection can have the result of allowing a simpler clustering algorithm.

- try something like $k$-means or $DP$-means to see how it goes. To evaluate this, test your data against its ultimate use. If that is making pictures, look at the pictures. If it is intended to generate features for a downstream model check how much lift your clustering gives you. 

As an example, suppose that our data is heavily intertwined like these two spirals.

The conventional wisdom is that this kind of data is not suitable for simple clustering methods like $k$-means. But here, I have used 40 clusters (see the x marks) and you can see that the data nearest to each cluster is from a single class. That means that the distance to each of the clusters is an interesting restatement of the original data that makes things much simpler. 

Note that this is the opposite of dimensionality reduction. The original data had 2 dimensions, but the restated data as distances to clusters now has 40 dimensions. Classification in this restated space is vastly simpler than in the original space because we just have to find which of the 40 values is smallest to find which color the point should be. And, indeed, that is what we see. 

Here is an ROC curve for two linear models. The brown line shows a linear classifier that is based on the original data. You can see how it barely exceeds the performance of a random classifier (the black diagonal). On the other hand, the classifier that uses the inverse distance to each of the 40 clusters is very good (you can tell because the green line goes very close to the upper left corner).

In fact, this model is so simple that you could hand code it by finding the nearest cluster and using the dominant color for that cluster.

This is wonderful and cool. It should be remembered, however, that you could get similarly good results by using a fairly simple neural network with enough hidden nodes. This is because the hidden nodes in a neural network function very similarly to the way that this cluster based classifier is working. This happens because the multiplication by the weights for the first layer is a dot product and this dot product is closely related to Euclidean distance.

The point here is that 40 clusters is far more than would have been recommended by most of the simple criteria that many people recommend (such as finding the knee in the total distance). That difference is because the intended use is very different.

So to mis-quote Forrest Gump, clustering is as clustering does!