Eval is the test suite for a probabilistic system — pick the wrong metric or leak your test data and a "97% accurate" model can be worthless; this is the discipline that separates a demo from production.
Evaluation is the test suite for a system that's allowed to be wrong — you can't assert output == expected, so you measure how often and in which direction it's wrong, and you guard the measurement so it doesn't lie to you. The two ways it lies are a metric that hides failure (accuracy on imbalanced data) and data leakage (the model peeked at the answers during training). If you're a SWE: picking a metric is choosing your assertion, and preventing leakage is keeping your test fixtures out of your training set — exactly the same crime as testing on your training data, which silently turns CI green while production burns.
Everyone wants to train the model. Almost nobody wants to design the eval — and that's precisely the gap that separates a flashy notebook from something you'd put in front of users. This is the most underrated skill in ML, and it's the one a strong SWE can dominate, because it's really just testing under uncertainty.
It shows up everywhere downstream:
== compare.Interviews silently assume you can answer "why is accuracy misleading here?" and "how would you know this model degraded in production?" If you fumble those, the rest of your ML knowledge doesn't get a chance to show.
The words first.
Step by step.
Remember this: the confusion matrix is the source of all classification metrics. Once you have those four counts, you're just doing arithmetic to answer "which errors matter most?"
Start with binary classification: each example is positive (the thing you're hunting for — spam, fraud, disease) or negative. The model predicts one of the two. Cross-tabulate prediction vs. truth and you get four cells, the confusion matrix:
| Actually Positive | Actually Negative | |
|---|---|---|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |
Every metric below is just a ratio of these four numbers. Internalize the table and the rest is arithmetic.
Accuracy = fraction you got right:
$$\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
Accuracy is "fraction of all predictions I got right." The numerator is TP + TN (correct predictions), and the denominator is the total: TP + TN + FP + FN (all predictions).
Let's work it with real numbers. Say we have a spam filter on 100 emails:
Accuracy = (30 + 55) / (30 + 55 + 10 + 5) = 85 / 100 = 0.85 — we got 85% right overall.
But here's the trap: if only 5% of mail is spam, then a model that says "never spam" gives (0 + 95) / 100 = 0.95 accuracy while catching zero real spam. This is why you always ask for the base rate — accuracy without context is a number that sounds good and hides disaster.
Every symbol is a count from the matrix. Clean, intuitive — and dangerously misleading when classes are imbalanced (one class is rare). If 1% of transactions are fraud, the model that predicts "never fraud" scores 99% accuracy while catching zero fraud. Accuracy rewards it for being right about the boring majority. The metric is technically true and operationally useless. Whenever you hear an accuracy number, your first question is "what's the base rate?" — the accuracy of the trivial always-predict-the-majority baseline.
Split "being right" into two questions:
$$\text{precision} = \frac{TP}{TP + FP} \qquad \text{recall} = \frac{TP}{TP + FN}$$
These split "being right" into two different questions using the same numerator TP but opposite denominators.
Precision = TP / (TP + FP) — "of things I flagged, what fraction were actually real?" The denominator is everything I predicted positive.
Recall = TP / (TP + FN) — "of real things that exist, what fraction did I catch?" The denominator is everything that's actually positive.
Using our 100-email example with TP=30, FP=10, FN=5, TN=55:
30 / (30 + 10) = 30 / 40 = 0.75 — 75% of flagged emails were real spam. One-quarter were false alarms.30 / (30 + 5) = 30 / 35 ≈ 0.86 — we caught 86% of real spam. We missed about 14%.They trade off: lower your threshold to flag more, and recall goes up (catch more spam) but precision falls (more false alarms). Raise the threshold and the reverse. Which error costs more? That determines your choice. Cancer screening? Optimize recall (a miss costs a life; a false alarm costs a retest). Spam filter? Optimize precision (a false alarm loses a real email; a miss is just annoying).
They trade off. A model has a knob — the threshold on its predicted probability. Lower the threshold and it flags more aggressively: recall rises (catches more), precision falls (more false alarms). Raise it and the reverse. There's no free lunch; you pick the operating point based on which error costs more:
| You care about | Because a miss (FN) / false alarm (FP) is… | Optimize for |
|---|---|---|
| Cancer screening | A miss can kill; a false alarm = one more test | Recall |
| Spam filter | A false alarm hides a real email (bad); a miss = mild annoyance | Precision |
| Fraud blocking | Blocking a legit purchase angers customers | Precision |
| Search / RAG retrieval | Missing the relevant doc means a wrong answer | Recall (then rerank for precision) |
This "which error is worse" question is a product decision, not a math one — which is exactly why eval is a design problem.
To compare models with a single score, combine precision and recall with their harmonic mean:
$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$
F1 is the harmonic mean of precision and recall — use it when you must compare models with a single number and you don't want to pick sides.
With precision = 0.75 and recall ≈ 0.86:
F1 = 2 × (0.75 × 0.86) / (0.75 + 0.86) = 2 × 0.645 / 1.61 ≈ 0.80
Why harmonic mean instead of the plain average? The harmonic mean punishes imbalance harder. If precision = 1.0 but recall = 0.0, the plain average is 0.5, but F1 = 0 — you don't get credit for perfection on one side while completely failing the other. F1 is the standard headline metric for imbalanced classification, and it's what you'll see in papers when nobody specifies which error matters more.
The harmonic mean (not the plain average) punishes imbalance: if precision is 1.0 but recall is 0.0, the simple average is 0.5 but F1 is 0 — you don't get credit for nailing one at the total expense of the other. F1 is the default headline metric for imbalanced classification. (The general $F_\beta$ lets you weight recall $\beta$ times more than precision when one matters more.)
A classifier usually outputs a probability, and the threshold is a choice you make after training. ROC/AUC evaluates the model across all thresholds at once, so it's threshold-independent.
The ROC curve plots true-positive rate (recall) against false-positive rate $\big(\frac{FP}{FP+TN}\big)$ as you sweep the threshold from strict to lenient. AUC is the area under that curve, from 0.5 (random coin-flip) to 1.0 (perfect).
The intuition that actually sticks: AUC is the probability that the model scores a random positive higher than a random negative. AUC = 0.9 means "pick one real fraud and one legit transaction at random; 90% of the time the model rates the fraud as more suspicious." It measures whether the model ranks correctly — useful, but note it can look rosy on heavy imbalance, where practitioners prefer the PR-AUC (area under the precision-recall curve) because it ignores the easy true negatives.
When the target is a number (price, temperature) instead of a class, you measure distance from the truth. With predictions $\hat{y}_i$ and truths $y_i$ over $n$ examples:
Rule of thumb: MAE if all errors hurt equally, RMSE/MSE if big errors hurt disproportionately.
A model's score on data it trained on is worthless: it can memorize. You measure generalization — performance on data it has never seen. So split your data three ways:
SWE analogy: train = the code, validation = the local CI you iterate against, test = the production smoke test you run exactly once before launch. Run the launch test fifty times while tweaking and it stops measuring launch readiness.
A single val split is noisy when data is small. k-fold cross-validation: chop the (non-test) data into $k$ equal parts (folds, typically $k=5$ or $10$). Train on $k-1$ folds, validate on the held-out one, rotate so each fold is the validation set exactly once, then average the $k$ scores. You get a more stable estimate plus an error bar, at $k\times$ the compute. (For imbalanced classes, use stratified k-fold so each fold keeps the same class ratio.)
Leakage is when information from the test set, or from the future, sneaks into training. The model looks brilliant in evaluation and collapses in production — the worst failure mode because nothing errors; the numbers just lie. Concrete forms:
account_closed_date — which only exists because they churned. In production that column is null. The model aces eval and is useless live.Catch it the way you catch a bug: a result that's too good is a symptom, not a celebration. Audit which features wouldn't exist at prediction time, and split before you touch the data.
When positives are rare: (1) measure right — use precision/recall/F1/PR-AUC, never bare accuracy; (2) resample — oversample the minority (e.g. SMOTE, which synthesizes new minority points) or undersample the majority; (3) reweight the loss so minority errors count more (class_weight="balanced"); (4) move the threshold to hit your target recall. Do resampling inside cross-validation folds, never before splitting — otherwise you leak.
The setup: 1,000 emails, 50 are actually spam (5% base rate). Confusion matrix: TP=40, FP=20, FN=10, TN=930.
Accuracy = (40 + 930) / 1,000 = 970 / 1,000 = 0.97 — 97% looks great. But the baseline (predict all ham) scores 950 / 1,000 = 0.95. So 97% is only 2 percentage points above "do nothing." Accuracy hides the real problem.
Precision = 40 / (40 + 20) = 40 / 60 ≈ 0.67 — of the 60 emails we flagged as spam, only 40 were real. One-third of our flags (20 emails) are false alarms — real messages lost to the spam folder. This is why you don't use accuracy on imbalanced data — it rewards getting the majority right.
Recall = 40 / (40 + 10) = 40 / 50 = 0.80 — we caught 80% of real spam. We missed 10 out of 50.
F1 = 2 × (0.67 × 0.80) / (0.67 + 0.80) = 2 × 0.536 / 1.47 ≈ 0.73
The lesson: The 97% headline hid the fact that the model is dropping 1-in-3 real emails. Always ask for precision and recall when classes are imbalanced — they're the honest metrics.
A spam filter on 1,000 emails; 50 are actually spam (5% base rate). The model produces this confusion matrix:
| Actually Spam | Actually Ham | |
|---|---|---|
| Predicted Spam | TP = 40 | FP = 20 |
| Predicted Ham | FN = 10 | TN = 930 |
The 97% headline hid that 1-in-3 flagged emails is a real message lost to the spam folder — a precision problem you'd never see from accuracy alone.
@@demo:loss-curve@@
Compute every metric from the four raw counts with numpy, so there's no magic.
import numpy as np
# Ground truth (1 = spam) and the model's predicted probability of spam.
y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0, 1, 0])
y_proba = np.array([0.9, 0.8, 0.3, 0.7, 0.2, 0.1, 0.4, 0.05, 0.6, 0.55])
# A classifier is a probability + a THRESHOLD. Move it and metrics move.
threshold = 0.5
y_pred = (y_proba >= threshold).astype(int)
# The confusion matrix is just four boolean-AND counts.
TP = int(np.sum((y_pred == 1) & (y_true == 1)))
FP = int(np.sum((y_pred == 1) & (y_true == 0)))
FN = int(np.sum((y_pred == 0) & (y_true == 1)))
TN = int(np.sum((y_pred == 0) & (y_true == 0)))
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) # of flagged, how many real
recall = TP / (TP + FN) # of real, how many caught
f1 = 2 * precision * recall / (precision + recall)
print(f"TP={TP} FP={FP} FN={FN} TN={TN}")
print(f"acc={accuracy:.2f} precision={precision:.2f} "
f"recall={recall:.2f} f1={f1:.2f}")
# AUC by its definition: P(random positive ranked above random negative).
pos = y_proba[y_true == 1] # scores of the actual positives
neg = y_proba[y_true == 0] # scores of the actual negatives
wins = np.mean([p > n for p in pos for n in neg]) # all positive/negative pairs
print(f"AUC (rank interpretation) = {wins:.2f}")Line by line: y_proba >= threshold turns probabilities into a decision — change threshold and watch precision and recall trade off. The four counts are the confusion matrix; everything else is arithmetic on them. The AUC block computes the metric straight from its definition — fraction of positive/negative pairs the model ranks correctly — which is the intuition worth keeping. (In real code you'd call sklearn.metrics: precision_recall_fscore_support, roc_auc_score, confusion_matrix — but now you know what they return.)
:::whiteboard What this code does, section by section
Setup (lines 176–181): We define ground truth y_true (1 for spam, 0 for not-spam) and y_proba (the model's confidence scores for each email, ranging 0–1). These are paired: row 0 means "email 0 was actually spam (y_true=1) and the model scored it 0.9 (pretty confident it's spam)."
Threshold and decision (lines 183–185): A classifier needs a threshold — we pick 0.5 here. The line y_pred = (y_proba >= threshold).astype(int) turns each probability into a yes/no: if the score is 0.5 or higher, predict spam (1), else not-spam (0). The threshold is a knob: lower it and the model flags more as spam (high recall, low precision); raise it and it gets pickier (low recall, high precision).
Confusion matrix (lines 188–191): Four boolean-AND operations count each outcome. (y_pred == 1) & (y_true == 1) finds rows where we predicted spam and it actually was spam — that's TP. Similarly for FP, FN, TN. These four numbers are the entire source of truth; everything else flows from them.
Metrics (lines 193–196): Accuracy, precision, recall, F1 are all arithmetic — divide the confusion matrix counts the right way. Precision checks false alarms; recall checks misses.
AUC (lines 202–206): The AUC code skips the threshold entirely. It asks: "how often does the model score a random real positive higher than a random real negative?" Loop over all positive/negative pairs, count wins, and divide by the total. That fraction is AUC — the intuition that sticks: AUC measures ranking quality, not any single decision threshold.
assert response is not None — green forever, catches nothing. Precision/recall is assert response == expected_for_the_cases_that_matter.account_closed_date, which only exists after churn. Catch it by auditing whether each feature exists at prediction time, splitting before any preprocessing, and treating a suspiciously high score as a red flag to investigate, not celebrate.You're ready to move on when you can read off precision, recall, and F1 from a confusion matrix, explain why accuracy lies on imbalanced data, and spot data leakage in a feature list — without reaching for a reference.
Next: you now have the full ML-foundations toolkit — see how these ideas become LLM evals (golden sets, LLM-as-judge, regression suites) in /evals, or apply retrieval metrics directly in /rag. Back to the map at /ml-foundations.