5 stars based on
There are a lot of clustering algorithms to choose from. The standard sklearn clustering suite has thirteen different clustering classes alone. So what clustering algorithms should you be using? As with every question in data science and machine learning it depends on your data.
A binary heap union algorithmic trading of those thirteen classes in sklearn are specialised for certain tasks such as co-clustering and bi-clustering, or clustering features instead data points. Obviously an algorithm specializing in text clustering is going to be the right choice for clustering text data, and other algorithms specialize in other specific kinds of data.
Thus, if you know enough about your data, you can narrow down on the clustering algorithm that best suits that kind of data, or the sorts of important properties your data has, or the sorts of clustering you need done. So, what binary heap union algorithmic trading is good for exploratory data analysis?
There are other nice to have features like soft clusters, or overlapping clusters, but the above desiderata is enough to get started with because, oddly enough, very few clustering algorithms can satisfy them all!
Next we need some data. So, on to testing …. Before we try doing the clustering, there are some things to keep in mind as we look at the results. K-Means has a few problems however. That leads to the second problem: If you know a lot about your data then that is something you might expect to know.
Finally K-Means is also dependent upon initialization; give it multiple different random starts and you can binary heap union algorithmic trading multiple different clusterings. This does not engender much confidence in any individual clustering that may result.
K-means is going to throw points into clusters whether they belong or not; it also assumes you clusters are globular. K-Means scores very poorly on this point. If you have a good intuition for how many clusters the dataset your exploring has then great, otherwise you might have a problem. Hopefully the clustering is stable for your data. Best to have many runs and check though. This is K-Means big win. There are few algorithms that can compete with K-Means for performance.
If you binary heap union algorithmic trading truly huge data then K-Means might be your only option. But enough opinion, how does K-Means perform on our test dataset?
We see some interesting results. First, the assumption of perfectly globular clusters means that the natural clusters have been spliced and clumped into various more globular shapes.
Worse, the noise points get lumped into clusters as well: Having noise pollute your clusters like this is particularly bad in an EDA world since they can easily mislead your intuition and understanding of the data. On a more positive note we completed clustering very quickly indeed, so at least we can be wrong quickly.
Affinity Propagation has some advantages over K-Means. Second, due to how the algorithm works under the hood with the graph representation it allows for non-metric dissimilarities i. Finally Affinity Propagation does, at least, have better stability over runs but not over parameter binary heap union algorithmic trading The weak points of Affinity Propagation are similar to K-Means.
Since it partitions the data just like K-Means we expect to see the same sorts of problems, particularly with noisy data. Picking these parameters well can be difficult.
The implementation in sklearn default preference to the median dissimilarity. This tends to result in a very large number of clusters. A better value is something smaller or negative but data dependent. And how does it look in practice on our chosen dataset? The result is eerily similar to K-Means and has all the same problems. The noise points have been assigned to clusters regardless of being significant outliers.
Worse still it took us several seconds to arrive at this unenlightening conclusion. It is centroid based, like K-Means and affinity propagation, but can return clusters instead of a partition.
The underlying idea of the Mean Shift algorithm is that there exists some probability density function from which the data is drawn, and tries to place centroids of clusters at the maxima of that density function.
It approximates this via kernel density estimation techniques, and the key parameter is then the bandwidth of the kernel used. This is easier to guess than the number of clusters, but may require some staring at, say, the distributions of pairwise distances between data points to choose successfully. The other issue at least with the sklearn implementation is that it is fairly slow depsite potentially having good scaling! I spent a while trying to find a good bandwidth value that resulted in a reasonable clustering.
The choice below is about the best I found. Worse still it took over 4 seconds to cluster this small dataset! Spectral clustering can best be thought of as a graph clustering. For spatial data one can think of inducing a graph based on the distances between points potentially a k-NN graph, or even a dense graph. From there spectral clustering will look at the eigenvectors of the Laplacian of the graph to attempt to find a good low dimensional embedding of the graph into Euclidean space.
This is essentially a kind of manifold learning, finding a transformation of our original space so as to better represent manifold distances for some manifold that the data is assumed to lie on. Once we have the transformed space a standard clustering algorithm is run; with sklearn the default is K-Means.
That means that the key for spectral clustering is the transformation of the space. We unfortunately retain some of K-Means weaknesses: Worse, if we operate binary heap union algorithmic trading the dense graph of the binary heap union algorithmic trading matrix we have a very expensive initial step and sacrifice performance.
Spectral clustering performed better on the long thin clusters, but still ended up cutting some of them strangely and dumping parts of them in with other clusters. We also still have the issue of noise points polluting our clusters, so again our intuitions are going to be led astray. Performance was a distinct improvement of Affinity Propagation however. Over all we are doing better, but are still a long way from achieving our desiderata.
Agglomerative clustering is really a suite of algorithms all based on binary heap union algorithmic trading same idea. Do this repeatedly until you have only one cluster and you get get a hierarchy, or binary tree, of clusters branching down to the last layer which has a leaf for each point in the dataset. More complex variations use things like mean distance between clusters, or distance between cluster centroids etc.
Once you have a cluster hierarchy you can choose a level or cut according to some criteria and take the clusters at that level of the tree. You can also inspect the dendrogram of clusters and get more information about how clusters break down. On the other hand, if you want a flat set of clusters you need to choose a cut of the dendrogram, and that can be hard to determine.
We are also still partitioning rather than clustering the data, so we binary heap union algorithmic trading have that persistent issue of noise polluting our clusters. This is a more robust method than say single linkage, but it does tend toward more globular clusters. Similar to the spectral clustering we have handled the long thin clusters much better than K-Means or Affinity Propagation. We in fact improved on spectral clustering a bit on that front.
We also still have binary heap union algorithmic trading the noise points polluting our clusters. Applying single linkage clustering to the transformed binary heap union algorithmic trading results in a dendrogram, which we cut according to a distance parameter called epsilon or eps in many implementations to get clusters.
This provides several advantages: Better yet, since we can frame the algorithm in terms of local region queries we can use binary heap union algorithmic trading tricks such as kdtrees to get exceptionally good performance and binary heap union algorithmic trading to dataset sizes that are otherwise unapproachable with algorithms other than K-Means.
There are some catches however. Obviously epsilon can be hard to pick; you can do some data analysis and get a good guess, but the algorithm can be quite sensitive to the choice of the parameter.
So how does it cluster our test dataset? I played with a few binary heap union algorithmic trading values until I got somethign reasonable, but there was little science to this — getting the parameters right can be hard. We also picked up a few tiny clusters binary heap union algorithmic trading amongst the large sparse cluster. These problems are artifacts of not handling variable density clusters — to get the sparser clusters to cluster we end up lumping some of the denser clusters with them; in the meantime the very sparse cluster is still broken up into several clusters.
Their goal was to allow varying density clusters. Instead of taking an epsilon value as a cut level for the dendrogram however, a different approach is taken: That tree can then be used to select the most stable or persistent clusters. This process allows the tree to be cut at varying height, picking our varying density clusters based on cluster stability.
The immediate advantage of this is that we can have varying density clusters; the second benefit is that we have eliminated the epsilon parameter as we no longer need it to choose a cut of the dendrogram. This trades an unintuitive parameter for one that is not so hard to choose for EDA what is the minimum size binary heap union algorithmic trading I am willing to care about? Fortunately we can just import the hdbscan library and use it as if it were part of sklearn. If you are doing EDA you are trying to learn and gain intuitions about your data.
In that case it is far better to get no result at all than a result that is wrong. Bad results lead to false intuitions which in turn send you down completely the wrong path. Not only do you not understand your data, you misunderstand your data. All clustering algorithms have parameters; you need some knobs to turn to adjust things.
If you know little about your data it can be hard to determine what value or setting a parameter should have. This means parameters need to be intuitive enough that you can hopefully set them without having to know a lot about your data. If you run the binary heap union algorithmic trading twice with a different random initialization, you should expect to get roughly the same clusters back.