How to Evaluate a Model

Recall, Specificity, Precision, F1 Scores, and Accuracy

Namrata Kapoor

Published in

The Startup

5 min readJan 15, 2021

There is an old saying “Accuracy builds credibility”-Jim Rohn.

However, accuracy in machine learning may mean a totally different thing and we may have to use different methods to validate a model.

When we develop a classification model, we need to measure how good it is to predict. For its evaluation, we need to know what do we mean by good predictions.

There are some metrics that measure and evaluate the model on its accuracy of actually predicting the class and also improves it.

Let us see a Confusion matrix that defines a number of rightly predicted happy, sad, and also wrongly predicted happy and sad.

Let's evaluate different evaluation techniques to understand which one to use in a given situation.

Recall or Sensitivity or True Positive Rates

Recall also known as True positive Rate, is the measure of True Positives Vs Sum of Predicted True Positives and Predicted False Negatives.

It measures correctly predicted positive happy cases from all the actual positive cases. It is important when the cost of False Negatives is high.

For example, if we want to predict fraud or a disease.

If a disease has to be predicted in a patient and it is for a highly contagious disease like COVID.

If a patient (True Positive) is detected as non-positive(wrong prediction)goes through the test and predicted as not sick (False Negative). The cost related to it will be very high and dangerous, as he/she may infect many.

A similar case will be of fraud detection where a fraud (True Positive) is predicted as not a fraud (False Negative), the result of it may have a high impact if it is in a bank.

The formula for it is as under:

When False negatives are equal to zero, The Value of sensitivity will be 1 which is the optimal value.

Easy way to remember its formula is that we need to focus on Actual Positives as in the diagram of recall.

Specificity

Specificity or the true negative rate is the measure of the proportion of True Negatives Vs Sum of Predicted False Positives and Predicted True Negatives.

It is favorable to measure a model of its specificity when the measure of False Positives is going to highly costly.

An example of it can be, a test to allow all healthy people as being negative for a particular illness is very specific. A highly specific test will correctly rule out people who don’t have a disease and will not generate any false-positive results.

If a test inaccurately identifies 20% of people as having a condition then it is not specific and will have a higher False-Positive value.

The formula is as under:

When false positives are zero the Specificity will be 1, which is a highly specific model.

Easy way to remember its formula is that we need to focus on Actual Negatives as in the diagram of Specificity.

Precision

Precision or the Positive Predictive Value is the measure of the proportion of True Positives Vs Sum of True Positives and Predicted False Positives.

It is specifically useful when the cost of False positives is high.

An example of high precision can be, email spam or ham.

In an email spam/ham classification if few relevant emails are termed as spam i.e. False Positive and Actual Negative.

In this case, the user may lose important information in emails, This model will have low precision and is not a good spam detection model.

The formula is as under:

When false positives are zero the Precision will be 1, which is a high precision model.

Easy way to remember its formula is that we need to focus on Predicted Positives as in the diagram of Precision.

Accuracy

Accuracy is the fraction of total samples accurately predicted to the sum of the whole samples.

The accuracy of the model defines the percentage of accurately identifying the samples to their classes.

However, it cannot be used as a very good measure for validating a model as it depends on the data and its balance of classes. If a particular class is a minority and accuracy is 99% which is mostly by predicting the majority class, we can’t say the model is performing well.

The accuracy model is better to use if there is no class imbalance, although it is not a real-life situation.

F1 Score

It is termed as a harmonic mean of Precision and Recall and it can give us better metrics of incorrectly classified classes than the Accuracy Metric.

It can be a better measure to use if we need to seek a balance between Precision and Recall. Also if there is a class imbalance (a large number of Actual Negatives and lesser Actual positives). Class imbalance is always there in real-life situations hence, it is always better to use F1-Score over accuracy.

In the F1 Score, we use the Harmonic Mean to penalize the extreme values.

If False negative and false Positive values are non-zero, the F1 Score reduces, and if these values are zero, it will be a perfect model that has high precision and sensitivity.

Conclusion

All these terms are very important to learn and use when needed depending on the situation you are in.

These may seem confusing at the start but once familiar will be of great help to analyze and rate a model.

It always depends on the situation you are in and what priority needs to be given to False Negative or False Positives for selecting a model.

I hope after reading this you will be more familiar with the situations and will be a better judge to use any model validating method.

Thanks for reading!

Originally published at https://www.numpyninja.com on January 15, 2021.

How to Evaluate a Model

Recall, Specificity, Precision, F1 Scores, and Accuracy

Written by Namrata Kapoor