TransWikia.com

Confusion matrix to check results

Data Science Asked by user105599 on August 11, 2021

I am a new user in StackExchange and a new learner of Data Science. I am working on better understanding how to estimate the results collected, specifically fake users extracted from a dataset running some analysis.

Using a specific algorithm, I found some users

User_Alg

user1
user2
user3
user28
user76
user67

and I would like to estimate accuracy of my algorithm comparing with the dataset which contains all the fake users manually labelled:

User_Dat
user1
user5
user28
user76
user67
user2
user29

As you can see, there are some users, in my extracted list (User_Alg), who are missing, i.e. not included in the list manually labelled (all the fake users in the dataset; User_Dat).
I have thought to use a confusion matrix to check the accuracy, but I would like to know from people with more experience in statistics and machine learning than me, if such method can be ok and how it looks like, or if you recommend another approach.

Thanks for your attention and your time.

2 Answers

A confusion matrix is indeed a very useful way to analyze the results of your experiment. It provides the exact number (or percentage) of instances with true class X predicted as class Y for all the possible classes. As such it gives a detailed picture of what the system classifies correctly or not.

But a confusion matrix is a bit too detailed if one wants to summarize the performance of the classifier as just one single value. This is useful especially when one wants to compare two different classifiers, since there's no general way to compare two confusion matrices. That's why people often use evaluation measures: for binary classification, the most common ones are:

  • Accuracy, which is simply the number of correct predictions divided by the total number of instances.
  • F-score, which itself is the harmonic mean of precision and recall. Quite ironically, F-score gives a more accurate picture of performance than accuracy because it takes into account the different types of possible errors.

Correct answer by Erwan on August 11, 2021

A confusion matrix is a great way to score a classifier. There are some additional metrics that are simply summary stats from a confusion matrix. Some of these are:

  • Accuracy - What percent of your predictions are correct. You can calculate this by the total number of true positives + true negatives divided by the number of data points (in your case, users). (TP + TN) / total predictions.
  • Precision - What percent of the predicted positives are correct. So true positives divided by total number of predicted positives, TP / (TP + FP).
  • Recall - What percent of the total positives did you catch? Number of true positives divided by total number of positives in the total population, TP / (TP + FN).
  • True Negative Rate - What percent of predicted negatives are correct. TN / (TN + FP).

In general, accuracy isn't used too often as it can be very misleading for skewed frequencies. Data science generally focusses on the other metrics I mentioned. There's almost always a tradeoff between precision and recall and understanding the use case lets you weigh the tradeoffs. For example, a cancer diagnostic blood test often favors recall over precision so it doesn't miss any true positives. A followup test (eg MRI) can often help distinguish the true positives from false positives, which I probably would guess is biased towards precision so that no patients undergo unnecessary surgeries. To best understand the tradeoff, sometimes a ROC curve is generated (plot of false positive rate vs recall). This wikipedia page is a great starting point: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Answered by Timothy Chan on August 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP