Evaluation Metrics

Published in

CodeX

5 min readApr 21, 2021

Welcome to this article about evaluation metrics, I assume you are here because you run into this concept while learning about classification models and you are looking for an extra explanation or revision.

In any Machine Learning problem, once we got a result, we want to measure how accurate is our model. In Regression problems, The accuracy is generally measured in terms of the difference in the actual values and the predicted values. For that, we use metrics like R squares score, adjusted R squared, Mean Squared Error etc.

But…What about classification problems? How can we tell if the algorithms are getting better and how are they doing overall? What are the metrics to measure the credibility of a classification model?

In a classification problem, the credibility of the model is measured using the confusion matrix generated, i.e., how accurately the true positives and true negatives were predicted. The different metrics used for this purpose are:

Accuracy
Recall
Precision
F1 Score
Specificity
ROC(Receiver Operator Characteristic)
AUC( Area Under the Curve)

Confusion Matrix

In order to understand the above metrics, We have to recall the confusion matrix. A typical confusion matrix looks like this:

Confusion Matrix can be confusing for many beginners hence its name. But here’re all what it tells:

True Positive(TP): When the result is predicted as positive while it truly is Positive.
True Negative(TN): When the result is predicted as Negative while it truly is Negative.
False Positive(FP): When the result is predicted as Positive while it is actually Negative (…model was confused)
False Negative(FN): When the result is predicted as Negative while it is actually Positive (…model was again confused)

Example:

Suppose we have the following confusion matrix for COVID-19 PCR test:

We say:

True Positive(TP): for someone who has covid-19 and has tested positive.
True Negative(TN): for someone who doesn’t have covid-19 and has tested negative.
False Positive(FP): for someone who doesn’t have covid-19 but has tested positive.
False Negative(FN): for someone who has covid-19 but has tested negative.

Accuracy

It is the number of data points(items) identified correctly as being member of its class divided by the number of all other items. It’s described as the formula below:

In our previous example of covid-19:

𝑇𝑃=20, 𝑇𝑁=40, 𝐹𝑃=0, 𝐹𝑁=40

Accuracy is a fine metric but it has some shortcomings, It is a fine choice for classification problems which are well balanced and not skewed or No class imbalance.

Recall or Sensitivity

It’s telling how many positives were recalled from the dataset.

In our previous example of covid-19:

Precision

As the name suggests, it means how precise the model is. How many items were labeled as positive while being actually positive.

For instance, In our covid-19 test example, precision is the rate in which our algorithm predicts a covid-19 infected person while the person really has covid-19.

F1 Score

From the previous examples, it is clear that we need a metric that considers both Precision and Recall for evaluating a model. One such metric is the F1 score.

F1 score is defined as the harmonic mean of Precision and Recall.

The mathematical formula is:

Specificity or True Negative Rate

This represents how specific is the model while predicting the True Negatives. Mathematically,

ROC(Receiver Operator Characteristic)

It represents the various confusion matrices for various thresholds. Each black dot is one confusion matrix. The ROC curve answers the question of which threshold for a classification problem to choose.

The green dotted line represents the case when true positive rate equals to false positive rate.
As we move from the rightmost dot towards left, after a certain threshold, the false positive rate decreases.
After some time, the false positive rate becomes zero.
The point encircled in green is the best point as it predicts all the values correctly and keeps the False positive as a minimum.

AUC(Area Under Curve)

AUC helps choosing the right model among other models for which we have plotted the ROC curves, The best model is the one which has the maximum area under it.

In this diagram, amongst the two curves, the model that resulted in the red one should be chosen as it clearly covers more area than the blue one

Picking the right metric?

Depending on what you want your Machine learning algorithm to do in many cases you care about specific outcomes more than other. For example, in our covid-19 test, we care more if an infected person is being actually detected even if that means we tolerate some false positives occasionally than letting someone who has covid-19 thinks he doesn’t. which could/may be disastrous for his own health and others around him.

That’s the case where your evaluation metric favors one type of error differently from another. So it really depends on the business requirement.

As evident from the previous example, Where we are predicting covid-19, the model had a very high Precision but performed poorly in terms of Recall. What we really needed is actually a 100 % recall.

But suppose we are predicting whether a person is innocent or not, then we need a 100% precision.

By choosing the right metric for you, you can tune your algorithm to exactly what you wanted it to be optimized for :)