TransWikia.com

Precision and recall for clustering?

Cross Validated Asked by Learner on September 20, 2020

Im confused about calculating precision and recall for clustering mentioned in this paper, Model-based Overlapping Clustering, A Banerjee et al., (last paragraph of column 1 on page 5).

Suppose, if we are given with resultant cluster indicator matrix $C$ of size $n times k$ where $n$ is the number of data points and $k$ is the number of cluster. A data point may be assigned to multiple cluster. Can any one provide me the exact details of calculating precision and recall for this resultant cluster indicator matrix $C$ as mentioned in the paper. Assume the true cluster indicator matrix is also available, say $True_C$

3 Answers

There's a wikipedia page on precision and recall that is quite clear on the definitions.

In the paper to which you refer, they say:

To evaluate the clustering results, precision, recall, and F-measure were calculated over pairs of points. For each pair of points that share at least one cluster in the overlapping clustering results, these measures try to estimate whether the prediction of this pair as being in the same cluster was correct with respect to the underlying true categories in the data. Precision is calculated as the fraction of pairs correctly put in the same cluster, recall is the fraction of actual pairs that were identified, and F-measure is the harmonic mean of precision and recall.

The only thing that is potentially tricky is that a given point may appear in multiple clusters. The authors appear to look at all pairs of points, say (x,y), and ask whether one of the clusters that contains point x also contains point y. A true positive (tp) is when this is both the case in truth and for the inferred clusters. A false positive (fp) would be when this is not the case in truth but is the case for the inferred clusters. A false negative (fn) would be when this is the case in truth but not for the inferred clusters.

Then precision = tp / (tp + fp) and recall = tp / (tp + fn).

Correct answer by Karl on September 20, 2020

  1. True Positive (TP) Assignment: when similar members are assigned to the same community. This is a correct decision.
  2. True Negative (TN) Assignment: when dissimilar members are assigned to dierent communities. This is a correct decision.
  3. False Negative (FN) Assignment: when similar members are assigned to dierent communities. This is an incorrect decision.
  4. False Positive (FP) Assignment: when dissimilar members are assigned to the same community. This is an incorrect decision.

enter image description here

enter image description here

  • Reference Social Data Mining. by Reza Zafarani Mohammad Ali Abbasi Huan Liu

Answered by Anshuman Kirty on September 20, 2020

If you want to use precision and recall for clustering, which isn't always used to evaluate the clustering, here is a useful link with a good example of how to find them for clustering: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

I will work through the relevant part of the linked example for finding TP, TN, FP, FN; adding details where necessary. As stated in the linked site:

A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters. There are two types of errors we can commit. A (FP) decision assigns two dissimilar documents to the same cluster. A (FN) decision assigns two similar documents to different clusters.

Consider the following example, enter image description here

Here we have three actual groups: x's, o's, and diamonds which we have tried to cluster into cluster 1, cluster 2, and cluster 3. Some mistakes were made. For example, cluster 2 has an x, four o's, and a diamond included in it. Now to quantify the TP's, FP's, TN's, FN's.

We will consider all pairs of documents, of which there are $N(N-1)/2=136$, since we have $N=17$ documents.

Now for TP+FP (all positives), we need to find out all pairs of x's, o's, and diamonds (not necessarily matching types) that exist in the same cluster. $6 choose 2$ pairs of x's in cluster 1, etc. This gives us

$TP+FP = {6 choose 2} + {6 choose 2} + {5 choose 2} = 40$ total positives

True positives are only the pairs that are of same type. For example, pairs of x's in cluster 1 is $5 choose 2$. This gives us,

$TP = {5 choose 2}+{4 choose 2}+{3 choose 2}+{2 choose 2}=20$

This leaves $40-20=20$ FP's.

Now for the total number of negatives, which is not in the link I provided. The total negatives plus positives must equal N, and thus $N-totalPostives=totalNegatives$. So there are $136 - 40 = 96$ negatives in total.

The number of FN's can be found by looking at pairs that should be grouped together, but are not. I will do the x's first. Cluster 1 has 5 x's each paired to three mismatches (3*5=15) plus cluster 2 has 1 x that is paired to two mismatched x's in cluster three that have not been accounted for (2*1=2). The o's are the same. Cluster 1 has one o, which is paired to 4 mismatched o's (1*4=4) in cluster 2. Now for the diamonds. Cluster 2 has one diamond, which is paired to 3 mismatched diamonds in cluster 3 (1*3). Add them up,

$FN = 3*5+2*1+1*4+1*3=24$

Since total negatives is 96, there must be 96-24=72 TN's.

The final confusion matrix is (as per the website):

enter image description here

And as Karl said precision and recall are:

enter image description here

Answered by Underminer on September 20, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP