A binary classification model predicts whether an instance belongs to one of two classes (e.g., positive vs. negative, spam vs. not spam). Evaluating the performance of such models is crucial to understand their effectiveness and make informed decisions about their deployment.
The metrics—precision, recall, specificity, and a few others—are commonly used to evaluate classification models. They all derive from the confusion matrix, which summarizes the results of a binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) “hit” |
False Negative (FN) “miss” |
| Actual Negative | False Positive (FP) “false alarm” |
True Negative (TN) “correct rejection” |
Precision is also known as Positive Predictive Value.
$$ \text{Precision} = \frac{TP}{TP + FP} $$
Precision answers the question: When the model predicts positive, how often is it correct?
Recall is also known as Sensitivity or True Positive Rate.
$$ \text{Recall} = \frac{TP}{TP + FN} $$
Recall answers the question: Out of all actual positives, how many did the model correctly identify?
Specificity is also known as True Negative Rate.
$$ \text{Specificity} = \frac{TN}{TN + FP} $$
Specificity answers the question: Out of all actual negatives, how many did the model correctly classify as negative?
F1 Score is the harmonic mean of precision and recall. It is a single metric that combines both precision and recall into one number.
$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
Accuracy is the most straightforward metric, defined as the ratio of correctly predicted instances to the total instances.
$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$
Most binary classification models output a probability score for the positive class. To convert this score into a binary prediction, we choose a threshold (e.g., 0.5). A probability score above the threshold is classified as positive, while a score below the threshold is classified as negative.
The choice of threshold affects the model’s performance in terms of the metrics discussed so far. To evaluate the model’s performance across all possible thresholds, we use the ROC curve and AUC.
ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s performance across different thresholds. It plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at each classification threshold.
In the ROC curve above:
AUC (Area Under Curve) is the area under the ROC curve. summarizes model performance across different thresholds. A higher AUC indicates a better classifier. It’s often used when you want to evaluate the model’s performance without being tied to a specific threshold.
The choice of metric depends on the specific problem and the consequences of false positives and false negatives. Here’s a summary:
| Scenario | Key Metric |
|---|---|
| False positives are costly (e.g., spam filters, fraud detection) | Precision |
| False negatives are costly (e.g., medical diagnosis, security threats) | Recall |
| Need a balance between precision & recall | F1 Score |
| Need to avoid unnecessary interventions (e.g., legal cases) | Specificity |
| Balanced dataset, overall correctness matters | Accuracy |
| Need threshold-independent evaluation | AUC-ROC |