PrereqIC3

Evaluation & Data

Eval is the test suite for a probabilistic system — pick the wrong metric or leak your test data and a "97% accurate" model can be worthless; this is the discipline that separates a demo from production.

13 min read · 19 sections

Prerequisites: what ML is (supervised learning, labels), how models learn (train/loss, overfitting)

1. The one-sentence intuition

Evaluation is the test suite for a system that's allowed to be wrong — you can't assert output == expected, so you measure how often and in which direction it's wrong, and you guard the measurement so it doesn't lie to you. The two ways it lies are a metric that hides failure (accuracy on imbalanced data) and data leakage (the model peeked at the answers during training). If you're a SWE: picking a metric is choosing your assertion, and preventing leakage is keeping your test fixtures out of your training set — exactly the same crime as testing on your training data, which silently turns CI green while production burns.

2. Why a software engineer needs this

Everyone wants to train the model. Almost nobody wants to design the eval — and that's precisely the gap that separates a flashy notebook from something you'd put in front of users. This is the most underrated skill in ML, and it's the one a strong SWE can dominate, because it's really just testing under uncertainty.

It shows up everywhere downstream:

RAG (/rag) lives or dies on retrieval metrics — recall@k, precision@k — and on whether your eval questions leaked into your index.
Fine-tuning (/finetuning) is meaningless without a held-out set; "the loss went down" is not "the model got better."
LLM evals (/evals) — golden datasets, LLM-as-judge, regression suites — are this lesson's ideas applied to text you can't == compare.
Agents (/agents) need trajectory evals: did it call the right tools in the right order, not just produce plausible final text?

Interviews silently assume you can answer "why is accuracy misleading here?" and "how would you know this model degraded in production?" If you fumble those, the rest of your ML knowledge doesn't get a chance to show.

3. Build it up from scratch

3.1 The confusion matrix — the source of all classification metrics

Beginner explainerNew here? The words first

The words first.

Confusion matrix — a 2×2 table that counts the four outcomes when a classifier makes predictions: TP, FP, FN, TN.
True Positive (TP) — the model said yes, and yes was correct (caught the thing you're hunting for).
False Positive (FP) — the model said yes, but no was correct (false alarm).
False Negative (FN) — the model said no, but yes was correct (you missed it).
True Negative (TN) — the model said no, and no was correct (correctly left alone).
Precision — of all the things the model flagged, how many were really true.
Recall — of all the things that are actually true, how many did the model catch.

Step by step.

Gather your ground-truth labels (what's actually spam or not).
Get the model's predictions (what the model says is spam or not).
Line up each prediction against the truth; put the result in one of four boxes.
Count how many fall in each box — that's your TP, FP, FN, TN.
Now every metric is just a ratio: divide the boxes the right way and you get precision, recall, accuracy, anything you need.

Remember this: the confusion matrix is the source of all classification metrics. Once you have those four counts, you're just doing arithmetic to answer "which errors matter most?"

Start with binary classification: each example is positive (the thing you're hunting for — spam, fraud, disease) or negative. The model predicts one of the two. Cross-tabulate prediction vs. truth and you get four cells, the confusion matrix:

	Actually Positive	Actually Negative
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

TP — caught it correctly.
FP — false alarm (flagged a good email as spam). Also called a Type I error.
FN — miss (let spam through). A Type II error.
TN — correctly left alone.

Every metric below is just a ratio of these four numbers. Internalize the table and the rest is arithmetic.

3.2 Accuracy, and why it lies

Accuracy = fraction you got right:

$$\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

✎ Accuracy — on real numbers

Accuracy is "fraction of all predictions I got right." The numerator is TP + TN (correct predictions), and the denominator is the total: TP + TN + FP + FN (all predictions).

Let's work it with real numbers. Say we have a spam filter on 100 emails:

TP = 30 (caught spam correctly)
FP = 10 (flagged good email as spam)
FN = 5 (missed spam)
TN = 55 (correctly left ham alone)

Accuracy = (30 + 55) / (30 + 55 + 10 + 5) = 85 / 100 = 0.85 — we got 85% right overall.

But here's the trap: if only 5% of mail is spam, then a model that says "never spam" gives (0 + 95) / 100 = 0.95 accuracy while catching zero real spam. This is why you always ask for the base rate — accuracy without context is a number that sounds good and hides disaster.

Every symbol is a count from the matrix. Clean, intuitive — and dangerously misleading when classes are imbalanced (one class is rare). If 1% of transactions are fraud, the model that predicts "never fraud" scores 99% accuracy while catching zero fraud. Accuracy rewards it for being right about the boring majority. The metric is technically true and operationally useless. Whenever you hear an accuracy number, your first question is "what's the base rate?" — the accuracy of the trivial always-predict-the-majority baseline.

3.3 Precision and recall — the two ways to be right

Split "being right" into two questions:

$$\text{precision} = \frac{TP}{TP + FP} \qquad \text{recall} = \frac{TP}{TP + FN}$$

✎ Precision and Recall — on real numbers

These split "being right" into two different questions using the same numerator TP but opposite denominators.

Precision = TP / (TP + FP) — "of things I flagged, what fraction were actually real?" The denominator is everything I predicted positive.

Recall = TP / (TP + FN) — "of real things that exist, what fraction did I catch?" The denominator is everything that's actually positive.

Using our 100-email example with TP=30, FP=10, FN=5, TN=55:

Precision = 30 / (30 + 10) = 30 / 40 = 0.75 — 75% of flagged emails were real spam. One-quarter were false alarms.
Recall = 30 / (30 + 5) = 30 / 35 ≈ 0.86 — we caught 86% of real spam. We missed about 14%.

They trade off: lower your threshold to flag more, and recall goes up (catch more spam) but precision falls (more false alarms). Raise the threshold and the reverse. Which error costs more? That determines your choice. Cancer screening? Optimize recall (a miss costs a life; a false alarm costs a retest). Spam filter? Optimize precision (a false alarm loses a real email; a miss is just annoying).

Precision: of the things I flagged, how many were real? Denominator is everything I predicted positive. High precision = few false alarms.
Recall (a.k.a. sensitivity, true-positive rate): of the real things, how many did I catch? Denominator is everything that's actually positive. High recall = few misses.

They trade off. A model has a knob — the threshold on its predicted probability. Lower the threshold and it flags more aggressively: recall rises (catches more), precision falls (more false alarms). Raise it and the reverse. There's no free lunch; you pick the operating point based on which error costs more:

You care about	Because a miss (FN) / false alarm (FP) is…	Optimize for
Cancer screening	A miss can kill; a false alarm = one more test	Recall
Spam filter	A false alarm hides a real email (bad); a miss = mild annoyance	Precision
Fraud blocking	Blocking a legit purchase angers customers	Precision
Search / RAG retrieval	Missing the relevant doc means a wrong answer	Recall (then rerank for precision)

This "which error is worse" question is a product decision, not a math one — which is exactly why eval is a design problem.

3.4 F1 — one number when you must have one

To compare models with a single score, combine precision and recall with their harmonic mean:

$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

✎ F1 — the harmonic mean on real numbers

F1 is the harmonic mean of precision and recall — use it when you must compare models with a single number and you don't want to pick sides.

With precision = 0.75 and recall ≈ 0.86: F1 = 2 × (0.75 × 0.86) / (0.75 + 0.86) = 2 × 0.645 / 1.61 ≈ 0.80

Why harmonic mean instead of the plain average? The harmonic mean punishes imbalance harder. If precision = 1.0 but recall = 0.0, the plain average is 0.5, but F1 = 0 — you don't get credit for perfection on one side while completely failing the other. F1 is the standard headline metric for imbalanced classification, and it's what you'll see in papers when nobody specifies which error matters more.

The harmonic mean (not the plain average) punishes imbalance: if precision is 1.0 but recall is 0.0, the simple average is 0.5 but F1 is 0 — you don't get credit for nailing one at the total expense of the other. F1 is the default headline metric for imbalanced classification. (The general $F_\beta$ lets you weight recall $\beta$ times more than precision when one matters more.)

3.5 ROC / AUC — grading the ranking, not the threshold

A classifier usually outputs a probability, and the threshold is a choice you make after training. ROC/AUC evaluates the model across all thresholds at once, so it's threshold-independent.

The ROC curve plots true-positive rate (recall) against false-positive rate $\big(\frac{FP}{FP+TN}\big)$ as you sweep the threshold from strict to lenient. AUC is the area under that curve, from 0.5 (random coin-flip) to 1.0 (perfect).

The intuition that actually sticks: AUC is the probability that the model scores a random positive higher than a random negative. AUC = 0.9 means "pick one real fraud and one legit transaction at random; 90% of the time the model rates the fraud as more suspicious." It measures whether the model ranks correctly — useful, but note it can look rosy on heavy imbalance, where practitioners prefer the PR-AUC (area under the precision-recall curve) because it ignores the easy true negatives.

3.6 Regression metrics, briefly

When the target is a number (price, temperature) instead of a class, you measure distance from the truth. With predictions $\hat{y}_i$ and truths $y_i$ over $n$ examples:

MAE (mean absolute error) $= \frac{1}{n}\sum |y_i - \hat{y}_i|$ — average miss, in the original units, robust to outliers.
MSE (mean squared error) $= \frac{1}{n}\sum (y_i - \hat{y}_i)^2$ — squares the errors, so it punishes large mistakes harder; RMSE is its square root, back in original units.
R² — fraction of the variance explained; 1.0 is perfect, 0 means "no better than predicting the mean."

Rule of thumb: MAE if all errors hurt equally, RMSE/MSE if big errors hurt disproportionately.

3.7 The data split — train / val / test

A model's score on data it trained on is worthless: it can memorize. You measure generalization — performance on data it has never seen. So split your data three ways:

Train (~60–80%): the model fits its parameters on this.
Validation / dev (~10–20%): you tune hyperparameters and pick models on this. The model doesn't learn from it, but you do — so it slowly gets "used up."
Test (~10–20%): touched once, at the very end, to report the honest number. If you peek at the test set and adjust, it's no longer a test set.

SWE analogy: train = the code, validation = the local CI you iterate against, test = the production smoke test you run exactly once before launch. Run the launch test fifty times while tweaking and it stops measuring launch readiness.

3.8 k-fold cross-validation — when data is scarce

A single val split is noisy when data is small. k-fold cross-validation: chop the (non-test) data into $k$ equal parts (folds, typically $k=5$ or $10$). Train on $k-1$ folds, validate on the held-out one, rotate so each fold is the validation set exactly once, then average the $k$ scores. You get a more stable estimate plus an error bar, at $k\times$ the compute. (For imbalanced classes, use stratified k-fold so each fold keeps the same class ratio.)

3.9 Data leakage — the #1 silent killer

Leakage is when information from the test set, or from the future, sneaks into training. The model looks brilliant in evaluation and collapses in production — the worst failure mode because nothing errors; the numbers just lie. Concrete forms:

Preprocessing before splitting. You normalize/scale using statistics (mean, std) computed over the whole dataset, then split. The training rows now carry information about the test rows. Fix: fit all transforms on train only, then apply to val/test.
Target leakage. A feature is a stand-in for the answer. Predicting "will this account churn?" using account_closed_date — which only exists because they churned. In production that column is null. The model aces eval and is useless live.
Temporal leakage. Random-splitting time-series so the model trains on the future and predicts the past. Always split by time when time matters.
Duplicate / near-duplicate rows straddling the split — common in scraped or RAG data; the model "recognizes" the test example. In LLM-land this is train/test contamination: the benchmark questions were in the pretraining data, so the eval score is inflated memorization.

Catch it the way you catch a bug: a result that's too good is a symptom, not a celebration. Audit which features wouldn't exist at prediction time, and split before you touch the data.

3.10 Class imbalance — what to actually do

When positives are rare: (1) measure right — use precision/recall/F1/PR-AUC, never bare accuracy; (2) resample — oversample the minority (e.g. SMOTE, which synthesizes new minority points) or undersample the majority; (3) reweight the loss so minority errors count more (class_weight="balanced"); (4) move the threshold to hit your target recall. Do resampling inside cross-validation folds, never before splitting — otherwise you leak.

3.11 Worked micro-example

✎ The spam filter micro-example — checking the arithmetic

The setup: 1,000 emails, 50 are actually spam (5% base rate). Confusion matrix: TP=40, FP=20, FN=10, TN=930.

Accuracy = (40 + 930) / 1,000 = 970 / 1,000 = 0.97 — 97% looks great. But the baseline (predict all ham) scores 950 / 1,000 = 0.95. So 97% is only 2 percentage points above "do nothing." Accuracy hides the real problem.

Precision = 40 / (40 + 20) = 40 / 60 ≈ 0.67 — of the 60 emails we flagged as spam, only 40 were real. One-third of our flags (20 emails) are false alarms — real messages lost to the spam folder. This is why you don't use accuracy on imbalanced data — it rewards getting the majority right.

Recall = 40 / (40 + 10) = 40 / 50 = 0.80 — we caught 80% of real spam. We missed 10 out of 50.

F1 = 2 × (0.67 × 0.80) / (0.67 + 0.80) = 2 × 0.536 / 1.47 ≈ 0.73

The lesson: The 97% headline hid the fact that the model is dropping 1-in-3 real emails. Always ask for precision and recall when classes are imbalanced — they're the honest metrics.

A spam filter on 1,000 emails; 50 are actually spam (5% base rate). The model produces this confusion matrix:

	Actually Spam	Actually Ham
Predicted Spam	TP = 40	FP = 20
Predicted Ham	FN = 10	TN = 930

Accuracy $= (40+930)/1000 = 0.97$ — looks great…
…but the do-nothing baseline "predict all ham" scores $950/1000 = 0.95$. So 97% is barely above trivial.
Precision $= 40/(40+20) = 0.67$ — a third of flagged emails were fine.
Recall $= 40/(40+10) = 0.80$ — caught 80% of spam.
F1 $= 2(0.67)(0.80)/(0.67+0.80) = 0.73$.

The 97% headline hid that 1-in-3 flagged emails is a real message lost to the spam folder — a precision problem you'd never see from accuracy alone.

@@demo:loss-curve@@

4. See it in code

Compute every metric from the four raw counts with numpy, so there's no magic.

import numpy as np
 
# Ground truth (1 = spam) and the model's predicted probability of spam.
y_true  = np.array([1, 1, 1, 0, 0, 0, 0, 0, 1, 0])
y_proba = np.array([0.9, 0.8, 0.3, 0.7, 0.2, 0.1, 0.4, 0.05, 0.6, 0.55])
 
# A classifier is a probability + a THRESHOLD. Move it and metrics move.
threshold = 0.5
y_pred = (y_proba >= threshold).astype(int)
 
# The confusion matrix is just four boolean-AND counts.
TP = int(np.sum((y_pred == 1) & (y_true == 1)))
FP = int(np.sum((y_pred == 1) & (y_true == 0)))
FN = int(np.sum((y_pred == 0) & (y_true == 1)))
TN = int(np.sum((y_pred == 0) & (y_true == 0)))
 
accuracy  = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)          # of flagged, how many real
recall    = TP / (TP + FN)          # of real, how many caught
f1        = 2 * precision * recall / (precision + recall)
 
print(f"TP={TP} FP={FP} FN={FN} TN={TN}")
print(f"acc={accuracy:.2f}  precision={precision:.2f}  "
      f"recall={recall:.2f}  f1={f1:.2f}")
 
# AUC by its definition: P(random positive ranked above random negative).
pos = y_proba[y_true == 1]          # scores of the actual positives
neg = y_proba[y_true == 0]          # scores of the actual negatives
wins = np.mean([p > n for p in pos for n in neg])   # all positive/negative pairs
print(f"AUC (rank interpretation) = {wins:.2f}")

Line by line: y_proba >= threshold turns probabilities into a decision — change threshold and watch precision and recall trade off. The four counts are the confusion matrix; everything else is arithmetic on them. The AUC block computes the metric straight from its definition — fraction of positive/negative pairs the model ranks correctly — which is the intuition worth keeping. (In real code you'd call sklearn.metrics: precision_recall_fscore_support, roc_auc_score, confusion_matrix — but now you know what they return.)

:::whiteboard What this code does, section by section

Setup (lines 176–181): We define ground truth y_true (1 for spam, 0 for not-spam) and y_proba (the model's confidence scores for each email, ranging 0–1). These are paired: row 0 means "email 0 was actually spam (y_true=1) and the model scored it 0.9 (pretty confident it's spam)."

Threshold and decision (lines 183–185): A classifier needs a threshold — we pick 0.5 here. The line y_pred = (y_proba >= threshold).astype(int) turns each probability into a yes/no: if the score is 0.5 or higher, predict spam (1), else not-spam (0). The threshold is a knob: lower it and the model flags more as spam (high recall, low precision); raise it and it gets pickier (low recall, high precision).

Confusion matrix (lines 188–191): Four boolean-AND operations count each outcome. (y_pred == 1) & (y_true == 1) finds rows where we predicted spam and it actually was spam — that's TP. Similarly for FP, FN, TN. These four numbers are the entire source of truth; everything else flows from them.

Metrics (lines 193–196): Accuracy, precision, recall, F1 are all arithmetic — divide the confusion matrix counts the right way. Precision checks false alarms; recall checks misses.

AUC (lines 202–206): The AUC code skips the threshold entirely. It asks: "how often does the model score a random real positive higher than a random real negative?" Loop over all positive/negative pairs, count wins, and divide by the total. That fraction is AUC — the intuition that sticks: AUC measures ranking quality, not any single decision threshold.

5. Mental models & SWE analogies

Metric ≈ the assertion in your test. A weak metric is assert response is not None — green forever, catches nothing. Precision/recall is assert response == expected_for_the_cases_that_matter.
Test set ≈ a held-out integration fixture. Train on it and your "passing tests" measure memorization, exactly like asserting against the same data you hard-coded.
Data leakage ≈ a global mutable variable bleeding test state into prod code. Invisible in the test run, catastrophic in production — and only found by tracing data provenance.
Threshold ≈ a feature flag / log-level dial. One knob slides the whole behavior from "alert on everything" (high recall, noisy) to "alert on nothing but certainties" (high precision, quiet). You tune it per environment.
Data drift ≈ dependency rot / bit rot. Your code is byte-identical, yet behavior degrades because the world (input distribution) changed under it — like an API you call silently changing its response shape.

6. Common confusions

"97% accuracy = good model." Only relative to the base rate. On 5%-positive data, 95% is the do-nothing baseline.
"Precision and recall are basically the same." Opposite denominators. Precision polices false alarms; recall polices misses. You usually trade one for the other.
"AUC measures accuracy." It measures ranking quality across all thresholds, independent of any single threshold or the actual decisions you ship.
"More data fixes imbalance." Not if the ratio stays the same. You need reweighting, resampling, or a better metric — not just volume.
"I'll normalize, then split." That's leakage. Split first; fit every transform on train only.
"The validation set is my safe final number." No — you tuned against it, so it's optimistic. The untouched test set is the honest one.
"It passed eval, so it'll work in prod." Only if eval data matches production data and stays matching it. Distributions drift; eval is a snapshot, not a guarantee.

7. Check yourself

[Prereq] Your spam classifier is 97% accurate. Is that good? Unknown without the base rate. If 5% of mail is spam, the "all ham" baseline already hits 95%, so 97% is barely better. Ask for precision and recall (or the confusion matrix) to see the real story — likely it's missing spam, raising false alarms, or both.

[IC3] Precision vs recall — define both and give a scenario for each. Precision = TP/(TP+FP): of what I flagged, how much was real (penalizes false alarms). Recall = TP/(TP+FN): of what's real, how much I caught (penalizes misses). Optimize recall for cancer screening (a miss is deadly, a false alarm is just another test). Optimize precision for a spam filter (a false alarm buries a real email; a missed spam is mild). They trade off via the decision threshold.

[IC3] What is data leakage? Concrete example + how you'd catch it. Train-time access to information unavailable at prediction time, inflating eval and crashing in prod. Example: predicting churn using account_closed_date, which only exists after churn. Catch it by auditing whether each feature exists at prediction time, splitting before any preprocessing, and treating a suspiciously high score as a red flag to investigate, not celebrate.

[IC3] How is shipping an ML model different from shipping normal software? Code is deterministic; a model is a probabilistic function of a data distribution. Two extra failure modes: data drift (inputs shift, so a byte-identical model silently degrades) and silent degradation (no exception, just rising error). So you don't just deploy — you monitor distributions and live metrics, keep a held-out golden set, and plan to retrain. The workflow is a loop: data → train → evaluate → deploy → monitor → back to data.

You're ready to move on when you can read off precision, recall, and F1 from a confusion matrix, explain why accuracy lies on imbalanced data, and spot data leakage in a feature list — without reaching for a reference.

8. Go deeper

Stanford CS229 — generalization, evaluation, and the bias-variance view of held-out error: cs229.stanford.edu
scikit-learn — Model evaluation — the canonical metric reference you'll actually use: scikit-learn.org/stable/modules/model_evaluation.html
scikit-learn — Cross-validation — k-fold, stratification, and the leakage traps: scikit-learn.org/stable/modules/cross_validation.html
Google ML Crash Course — ROC & AUC — the cleanest visual intuition: developers.google.com/.../roc-and-auc
Dive into Deep Learning — Generalization — train/val/test and overfitting from first principles: d2l.ai

Next: you now have the full ML-foundations toolkit — see how these ideas become LLM evals (golden sets, LLM-as-judge, regression suites) in /evals, or apply retrieval metrics directly in /rag. Back to the map at /ml-foundations.

Primary sources

← More in ML Foundations (for engineers)