Metrics to evaluate classification models

7 min readJul 1, 2022

Machine learning classification models assign each point in a dataset to a class. Once this categorization is performed, how can we evaluate the accuracy of the classification? What are the best metrics for each type of dataset? Let's see which metrics are available, what each of them means, and how we can better analyze model performance.

Classification algorithms

Supervised machine learning algorithms can be classified into Regression and Classification algorithms. Roughly speaking, the first type predicts numerical values such as house prices and scores, while the latter predicts categorical values such as yes or no (binary); spam or not-spam (binary); mammal, fish, reptile, amphibian, or bird (multiclass). Therefore, when we want to assign categories to our data, we must rely on Classification algorithms. These algorithms learn from a given dataset where the classes are already assigned and then classify new observations.

Fig. 1 — The dataset contains two categories: yellow triangles and green rectangles. A classification algorithm learns from this data and predicts a class for new data.

Classification algorithms can be divided into two types: lazy learners and eager learners. Lazy learners store the dataset and only make predictions when asked, meaning when test data is provided to be classified. They will, therefore, spend more time classifying than learning. One example is the K-NN algorithm. Eager learners, on the other hand, start learning from the provided dataset and classify data before being supplied with the test data. Therefore, these algorithms spend more time learning than classifying. Examples include decision trees and Naive-Bayes.

Evaluation

Once a classification algorithm is chosen, and new observations are categorized, we must analyze classification performance; in other words, we must define a way to know how good the algorithm is in predicting the correct categories. And for this, we have what is called evaluation metrics. There are many evaluation metrics. Let's understand the most used ones:

Accuracy: is a metric that calculates how many correct predictions the model made. Consider the table in Fig. 2 (also called the confusion matrix). The correct predictions are in the green and purple blocks. The incorrect predictions are in the blue blocks. The model was able to predict correctly 500 True Positives (TP) + 300 True Negatives (TN) totalizing 800 correct predictions over a total of 500 + 125 + 145 + 300 = 1070 predictions, which means the model was right in 74.76% of its predictions. This is the accuracy of the model. In Fig. 2, the False Positive (FP) and the False Negative (FN) are represented by the blue blocks and are, respectively, the Positive class predicted when the actual class is Negative, and the Negative class predicted when the actual class is Positive.

Fig. 2 — Actual class and predictions made by the model. Green block: True Positive, purple block: True Negative, blue blocks: False Positive and False Negative.

The mathematical formula for the accuracy, as you may have already imagined, is given by:

which means we sum the TP (green block) with the TN (purple block) predicted by the model and divide by the total number of predictions (TP + TN + FP + FN — green, purple, and blue blocks). The higher the accuracy, the better the model* (* some critical observations about this statement will be discussed in the next section).

Recall: is a measure of how many True Positive the model predicts over the total number of actual Positive. The mathematical formula is:

In our table, Recall is given by the 500 TP over the 500+125 = 625 actual Positive, which returns the value of 0.8. We can notice that the best value of Recall is 1 when the model could predict the Positive class correctly, and there is no FN.

Fig.4 — Recall measures how many TP out of the total number of Positive.

We can conclude that Recall is an excellent metric to evaluate the best model when we don't want our model to predict many False Negatives. For example, we can think of sick patient prediction in a scenario of contagious disease. If a patient is predicted as Negative (doesn't have the disease) but is an actual Positive (has the disease), then this error can cause the spreading of the disease. Therefore, we want a model to predict the most correctly possible True Positives, which is done by minimizing the number of False Negative predictions, thus increasing the value of Recall.

Precision: measures how many True Positive the model predicts over the total number of predicted Positive. The mathematical formula is:

In our table, Precision is given by the 500 TP over the 500+145 = 645 predicted Positive, which returns the value of ~0.78. We can notice that the best value of Precision is 1, which means the model could predict the Positive class correctly and there is no FP.

Fig.5 — Precision measures how many TP out of the total number of predicted Positive.

Similarly to Recall, we can conclude Precision is an excellent metric to evaluate model performance if predicting too many False Positives has a high cost. The example we can think of here is spam detection. If an email is not spam (Negative) but is predicted as spam (False Positive), the receiver may lose essential emails. Therefore, we will seek a model that will better correctly predict spam by minimizing the number of False Positives, thus increasing the value of Precision.

F1 score: is a measure defined as a function of Precision and Recall, the mathematical formula given by

Therefore, we can say that the F1 score will be of good use when seeking a balance between Precision and Recall. For example, if FN and FP predictions come at equal costs.

AUC ROC: a ROC (Receiver Operating Characteristic) is a curve in a graph where the y-axis is the True Positive Rate, a.k.a. Recall, and the x-axis is the False Positive Rate (FPR), which is given by the False Positives over the total actual Negative. The mathematical formula for FPR is :

Fig. 6 — False Positive Rate is calculated as the False Positive (FP) over the total actual Negative (FP + TN).

Several ROC curves are shown below. Each point of the curves corresponds to a different classification threshold, a value that is chosen depending on the costs of predicting false positives. For example, the x=y curve (pink) is the curve for a not-skilled model since it predicts TP and FP with the same rates, which is equivalent to a random prediction. Above that curve, the model becomes skilled, and the best curve is when it predicts all the Positive classes correctly, whereas below that curve, the model gets worse until it reaches a point where all predictions are FP.

The AUC is the Area Under the (ROC) Curve. It's an integral over the entire range of the TP and FP rates; therefore, it's an aggregate measurement of the model's performance for all the thresholds. As we can see from Fig. 8, the highest value for AUC is 1 (area of the square with sides 1) for the best curve when the model makes all Positive predictions correctly. An AOC of 0.5 (area of the triangle with sides 1) happens for the random performance, and the zero value is for the worst curve (no area under that curve).

Fig. 8 — AUC for two different ROC curves.

AUC is, therefore, an excellent metric to measure the quality of the model's predictions without worrying about choosing a threshold.

Balanced and imbalanced datasets

We now come back to the * in the Accuracy description. It was said: "The higher the accuracy, the better the model," but this is only a valid statement if your dataset is balanced, meaning that the number of observations in the classes is more or less the same, which is not always the case. So if we consider, for example, spam detection and disease prediction, it is very likely we will find a highly imbalanced dataset. For these cases, accuracy is not a good metric. For example, consider this Heart Disease dataset where we have 319795 lines representing the adults interviewed. If we plot a pie chart of this dataset, only 8.6% of the people affirmed they have heart disease. Therefore, if a model predicts that 90% don't have heart disease, the accuracy will be high, but it does not mean you have a good model.

Therefore, when dealing with an imbalanced dataset, we must rely on metrics such as Recall, Precision, F1 score, or AUC ROC. For example, in the case of the heart disease dataset, we want our model to predict TP well and avoid FN, so the patient can get adequate treatment if they have heart disease. We could then choose Recall and ROC AUC to analyze model performance.

Conclusions

There are many metrics available to analyze model performance. Each of them can be more or less suitable depending on whether the dataset is balanced or imbalanced and the costs of each prediction. Understanding each of the metrics available and knowing the best one to use is crucial to obtaining better models.

Metrics to evaluate classification models

Classification algorithms

Evaluation

Balanced and imbalanced datasets

Conclusions

Written by Naomy Duarte Gomes

Responses (1)