One basic metric to measure how "good" your clustering in comparison to your known class labels is called purity. Now this is an example of supervised learning where you have some idea of an external metric that is a labeling of instances based on real world data.
The mathematical definition of purity is as follows:
In words what this means is, quoting from a professor at Stanford university here,
To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.
A simple example would be if you had a very naive clustering that was produced via Kmeans with k=2 that looked like:
Cluster1 Label
1 A
5 B
7 B
3 B
2 B
Cluster2 Label
4 A
6 A
8 A
9 B
In Cluster1 there are 4 instances of label B and 1 instance of label A and Cluster2 has 3 instances with label A and 1 instance of cluster B. Now you are looking for the total purity so that would be the sum of the purities of each cluster, in this case k=2. So the purity of Cluster1 is the maximum number of instances in respect to the given labels divided by the total number of instances in Cluster1.
Therefore the purity of Cluster1 is:
4/5 = 0.80
The four comes from the fact that the label that occurs the most (B
) occurs 4 times and there are 5 total instances in the cluster.
So this follows that the purity of Cluster2 is:
3/4 = 0.75
Now the total purity is just the sum of the purities which is 1.55
. So what does this tell us? A cluster is considered to be "pure" if it has a purity of 1 since that indicates that all of the instances in that cluster are of the same label. This means your original label classification was pretty good and that your Kmeans did a pretty good job. The "best" purity score for an entire data set would be equal to your original K-number of clusters since that would imply that every cluster has an individual purity score of 1.
However, you need to be aware that purity is not always the best or most telling metric. For example, if you had 10 points and you chose k=10 then every cluster would have a purity of 1 and therefore an overall purity of 10 which equal k. In that instance it would be better to use different external metrics such as precision, recall, and F-measure. I would suggest looking into those if you can. And again to reiterate, this is only useful with supervised learning where you have pre-knowledge of a labeling system which I believe is the case from your question.
To answer your second question... choosing your K number of clusters is the most difficult part for Kmeans without any prior knowledge of the data. There are techniques as to mitigate the problems presented by choosing the initial K-number of clusters and centroids. Probably the most common is an algorithm called Kmeans++. I would suggest looking into that for further info.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…