Sensitivity, Specificity and Predictive Values – What is the best way to measure the performance of binary classification models?

In a recent blog post we discussed using the predictions from a logistic regression model for binary classification. Classification means assigning an outcome to an individual or case, usually for the purpose of making a decision. Examples include predicting which individuals will default on personal debt in order to decide who could be offered a credit card, or predicting which visitors to a retail website will make a purchase in order to direct marketing to them. In this blog we will discuss some ways of measuring the performance of a classification model, using as an example an analysis we carried out for Mid Essex NHS Trust. In that analysis we built a regression model to predict the probability of success when extubating intensive care patients, and identified a threshold probability beyond which a clinician was encouraged to extubate their patient.

In order to choose a threshold probability to turn a probability model into a classification model we usually consider the quantities sensitivity and specificity. The sensitivity (otherwise known as the true positive rate) is the proportion of successful extubations that are correctly classified as such, while the specificity (otherwise known as the true negative rate) is the proportion of unsuccessful extubations that are correctly classified as such. We would like to maximise both these quantities, but there is often a trade-off: as we decrease the threshold probability we tend to increase the rate of true positives, but decrease the rate of true negatives. This trade-off can be visualised using a ROC curve, allowing us to pick an optimal threshold.

After choosing a threshold probability, we have a classification model that predicts whether the extubation will be successful or not. We know that the model is not perfect and so a natural question to ask is: given the model’s classification, what is the probability that the extubation will be successful? This question is subtly different to that addressed by the sensitivity and specificity. Sensitivity and specificity condition on the true outcome e.g., given the true outcome, what is the probability that the model got the classification correct? But when, for example, clinicians are considering the extubation of new patients, we won’t know about the true outcome until after the event. Instead, we’re likely to be more interested in the question: given that the model says it’s OK to extubate, what is the probability that is the right decision? This is called the positive predictive value (PPV), while the probability of unsuccessful extubation given an “unsuccessful” classification is called the negative predictive value (NPV). Thus, rather than the true outcome, these predictive values condition on the model’s decision.

It turns out the positive predictive value, the negative predictive value, the sensitivity and the specificity are all tied together, and can all be calculated from a 2×2 table showing the observed counts of all combinations of classification and outcomes:

Extubation successful Extubation unsuccessful
Classification “successful” True positives (TP) False positives (FP)
Classification “unsuccessful” False negatives (FN) True negatives (TN)

 

The formulae for the various quantities are as follows:

Sensitivity = TP / (TP + FN)

Specificity = TN / (FP + TN)

PPV = TP / (TP + FP)

NPV = TN / (FN + TN)

Looking again at the model for the extubation study, we obtain the following four performance values:

Sensitivity = 98.3%

Specificity = 88.2%

PPV= 96.7%

NPV = 93.6%

The question is, which measures are most useful? One disadvantage of PPV and NPV is that they depend on the overall success rate in the population. For example, if extubations are usually successful then it is easy to achieve a high PPV by simply classifying every case as successful. On the other hand, if successful extubations are very rare then it is hard for even a good model to achieve a high PPV. Sensitivity and specificity do not have this problem – they do not depend on the overall success rate and as such they may be interpreted in absolute terms as a measure of the classifier’s performance.

Despite the aforementioned disadvantage, PPV and NPV can be useful measures when it comes to making individual diagnoses, since they indicate the probabilities of the outcomes given the classification. For example, the high PPV in our example means that if the classification is “successful” then the extubation can be performed with little concern as there is a high probability of success.

Thus, we have two distinct pairs of performance measures that can be extracted from a table of classification and outcome counts. Each pair addresses a different question: sensitivity and specificity tell us about the distribution of classifications given the true outcome, while PPV and NPV give the probabilities of the outcomes given the classification. All four measures can be useful so it is wise to check them all before implementing the classifier as a decision tool. You can read more about putting these measures into practice in our case study where we were tasked with assessing the accuracy of the Dyslexia QuickScreen Test.