Evaluating Binary Classification Models

A Summary of Key Metrics

1 Intro

A binary classification model predicts whether an instance belongs to one of two classes (e.g., positive vs. negative, spam vs. not spam). Evaluating the performance of such models is crucial to understand their effectiveness and make informed decisions about their deployment.

The metrics—precision, recall, specificity, and a few others—are commonly used to evaluate classification models. They all derive from the confusion matrix, which summarizes the results of a binary classification:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP)
“hit”
False Negative (FN)
“miss”
Actual Negative False Positive (FP)
“false alarm”
True Negative (TN)
“correct rejection”

2 Precision

Precision is also known as Positive Predictive Value.

$$ \text{Precision} = \frac{TP}{TP + FP} $$

Precision answers the question: When the model predicts positive, how often is it correct?

3 Recall

Recall is also known as Sensitivity or True Positive Rate.

$$ \text{Recall} = \frac{TP}{TP + FN} $$

Recall answers the question: Out of all actual positives, how many did the model correctly identify?

4 Specificity

Specificity is also known as True Negative Rate.

$$ \text{Specificity} = \frac{TN}{TN + FP} $$

Specificity answers the question: Out of all actual negatives, how many did the model correctly classify as negative?

5 F1 Score

F1 Score is the harmonic mean of precision and recall. It is a single metric that combines both precision and recall into one number.

$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

6 Accuracy

Accuracy is the most straightforward metric, defined as the ratio of correctly predicted instances to the total instances.

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

7 ROC Curve & AUC

Most binary classification models output a probability score for the positive class. To convert this score into a binary prediction, we choose a threshold (e.g., 0.5). A probability score above the threshold is classified as positive, while a score below the threshold is classified as negative.

The choice of threshold affects the model’s performance in terms of the metrics discussed so far. To evaluate the model’s performance across all possible thresholds, we use the ROC curve and AUC.

ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s performance across different thresholds. It plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at each classification threshold.

ROC Curve

In the ROC curve above:

AUC (Area Under Curve) is the area under the ROC curve. summarizes model performance across different thresholds. A higher AUC indicates a better classifier. It’s often used when you want to evaluate the model’s performance without being tied to a specific threshold.

8 Which Metric to Use?

The choice of metric depends on the specific problem and the consequences of false positives and false negatives. Here’s a summary:

Scenario Key Metric
False positives are costly (e.g., spam filters, fraud detection) Precision
False negatives are costly (e.g., medical diagnosis, security threats) Recall
Need a balance between precision & recall F1 Score
Need to avoid unnecessary interventions (e.g., legal cases) Specificity
Balanced dataset, overall correctness matters Accuracy
Need threshold-independent evaluation AUC-ROC

Prepared by Jay at MDAL with litedown and ChatGPT.