ML is writing a program by showing it examples instead of typing the rules — and the whole game is whether that learned program works on data it has never seen.
Machine learning is writing a program by showing it examples instead of typing the rules by hand. You don't code if email contains "viagra" then spam; you collect 100,000 emails labeled spam/not-spam and let an algorithm fit a function that maps email → label. The "code" is a big pile of numbers (the model's parameters) that got tuned to match your examples. As a SWE: it's like having a fuzzy spec and an enormous test suite, and a compiler that searches for any implementation that passes the tests — except the artifact it produces is a numeric function, not source you'd read.
The catch — and the entire field, really — is this: passing the tests you showed it is easy. The job is making it work on inputs it has never seen.
Every pillar downstream of here is ML wearing a costume:
Interviews silently assume you can say, in plain English, what supervised vs. self-supervised learning is, why we hold out a test set, and what overfitting looks like. Get these wrong and you sound like someone who has only used the API, never understood the machine. This lesson is the map; the rest of /ml-foundations fills in the territory.
The words first.
[sqft, bedrooms, price], an image is pixel values).Step by step.
f(x) = w·x + b).Remember this: Machine learning is searching for parameter values that make a function fit your examples well enough that it also works on unseen data.
| Traditional programming | Machine learning | |
|---|---|---|
| You write | the rules (logic) | the examples (data) + a model shape |
| The computer produces | the output | the rules (parameters), then outputs |
| Good when | rules are knowable & stable | rules are fuzzy, high-dimensional, or unknown |
| Debugging | read the code | inspect the data & errors |
Nobody can write the if/else for "is this photo a cat" or "is this the right next word." There are too many cases and no crisp rule. So instead we pick a flexible function with knobs and turn the knobs until it matches examples. That's it.
x): the input, as numbers. A house is [sqft, bedrooms, zip]; an image is a grid of pixel values; text becomes token IDs. Turning raw stuff into useful numbers is feature engineering (deep learning automates much of it).y): the answer you want — a price, a category, the next word.f): the function shape, e.g. a line, a tree, a neural network.θ, "theta"): the knobs inside f that learning adjusts. A linear model f(x) = w·x + b has parameters w (weights) and b (bias); GPT-class models have hundreds of billions.The model with its knobs set is written f_θ — "f parameterized by θ." Learning = finding good θ.
The flavors differ only in what kind of examples (supervision signal) you get.
| Type | What you give it | Example | "Label" comes from |
|---|---|---|---|
| Supervised | inputs with correct answers | emails → spam/not-spam; house → price | humans / measurement |
| Unsupervised | inputs only, find structure | cluster customers; reduce dimensions | nothing — it finds patterns |
| Self-supervised | inputs only, but invent a labeled task from them | hide a word, predict it | the data itself |
| Reinforcement (RL) | an environment + a reward signal | game-playing; "was this answer helpful?" | trial, error, reward |
mat. The input and the label both come from the same sentence. Do this billions of times and the model learns grammar, facts, and reasoning as a side effect of getting good at filling blanks. This is pretraining.Which one do LLMs use? Both, in stages. First self-supervised pretraining (next-token prediction over the internet) to learn language and world knowledge. Then post-training to make it a helpful assistant, classically via RLHF — Reinforcement Learning from Human Feedback: humans rank model answers, that trains a reward model, and RL nudges the LLM toward higher-reward (more helpful, less harmful) responses. So a chatbot is self-supervised pretraining plus an RL-style alignment step. You'll go deep on this in /finetuning.
Two distinct phases, and conflating them is a classic beginner error:
x, read off f_θ(x). Every time you call an LLM API, you're doing inference; the parameters don't change.Think compile-time vs. run-time. Training is the slow build that bakes the artifact; inference is calling the built binary.
To "turn the knobs," you need a number that says how wrong the model currently is. That's the loss function L. For regression, a standard choice is squared error: L(ŷ, y) = (ŷ − y)², where ŷ = f_θ(x) is the prediction ("y-hat") and y is the truth. Squaring makes all errors positive and punishes big misses harder.
Learning is then an optimization problem — find the θ that minimizes average loss over your N training examples:
Every symbol in plain words. The formula finds the parameter values θ (the knobs) that make the average loss as small as possible across all N training examples. argmin_θ means "pick the θ that makes the following expression smallest." Σᵢ means "add up over all i from 1 to N." L(...) is the loss for one example — how wrong the prediction f_θ(xᵢ) is compared to the true label yᵢ. Dividing by N gives the average loss.
Concrete example with real numbers. Say you're learning to predict house prices. You have N = 3 training examples:
x₁ = [2000 sqft], y₁ = $300k (true price)x₂ = [1500 sqft], y₂ = $250kx₃ = [3000 sqft], y₃ = $400kYou pick a linear model f_θ(x) = w·x + b. On the first pass, guess w = 0.1, b = 0. Your predictions are:
f_θ(x₁) = 0.1 * 2000 + 0 = 200 (in thousands) — the model says $200kf_θ(x₂) = 0.1 * 1500 + 0 = 150 — the model says $150kf_θ(x₃) = 0.1 * 3000 + 0 = 300 — the model says $300kUsing squared error loss L(ŷ, y) = (ŷ − y)²:
(200 − 300)² = 10000(150 − 250)² = 10000(300 − 400)² = 10000Average loss (training error) = (10000 + 10000 + 10000) / 3 ≈ 10000. Now adjust w and b to smaller values (e.g., w = 0.15, b = 10) and recalculate; the loss shrinks. Learning is repeating this process until no change in θ makes the average loss meaningfully smaller.
What just happened: we measured how far off the model was on average, and that single number guides which direction to tune the knobs.
θ* = argmin_θ (1/N) Σᵢ L( f_θ(xᵢ), yᵢ )Reading it symbol by symbol: argmin_θ = "the θ that makes the following smallest"; Σᵢ = "sum over training examples i = 1..N"; (xᵢ, yᵢ) = the i-th example's features and label; f_θ(xᵢ) = the model's prediction for it. The quantity being minimized — average loss on the training set — is called the training error (a.k.a. empirical risk). How you actually find θ* is gradient descent, the subject of /how-models-learn; here just hold the idea "search for knob settings that make the loss small."
Here's the twist. You don't actually care about training error. You care about loss on future, unseen data drawn from the same source — the generalization error. Minimizing training error is only a proxy; a model can ace the training set and be useless on anything new, the way code can pass exactly the tests you wrote and break in production.
To estimate the thing you actually care about, split your data before training:
| Split | Used for | Touches θ? | SWE analogy |
|---|---|---|---|
| Train (~70–80%) | fitting parameters θ | yes | the code you write to pass tests |
| Validation (~10–15%) | tuning hyperparameters, picking the model | indirectly | your dev/staging runs |
| Test (~10–15%) | one final, honest score | no | prod traffic you never trained on |
A hyperparameter is a setting you choose, not something learned — model size, how long to train, regularization strength. You tune those by checking the validation set. The test set is sacred: you look at it once, at the end. The reason for three splits and not two: the moment you use a set to make decisions, you start (subtly) fitting to it. Validation absorbs that contamination so the test set stays an uncontaminated estimate of real-world performance. Reusing the test set to pick models is data leakage, and it's why a "95% accurate" model can faceplant in production. (More in /ml-foundations/evaluation-and-data.)
Watch the gap between training error and validation error:
This is the bias–variance tradeoff. Conceptually, expected test error decomposes as:
expected test error ≈ bias² + variance + irreducible noisebias = error from wrong assumptions (too simple); variance = error from over-sensitivity to the particular training sample (too complex); irreducible noise = randomness no model can remove. Make the model more powerful → bias drops but variance rises; simplify it → the reverse. The sweet spot minimizes the sum. (Modern very-large nets complicate this picture, but the intuition is the right starting mental model — and exactly what interviewers want.)
The cure for overfitting is regularization — anything that discourages the model from memorizing. The intuition: penalize complexity so the model prefers the simplest explanation that fits (Occam's razor, enforced numerically). Common forms: L2/weight decay (add a penalty for large weights, so it can't contort wildly), dropout (randomly silence neurons during training so it can't rely on any single one), early stopping (quit when validation error starts climbing), and the cheapest cure of all — more data. We'll make this concrete in /how-models-learn.
You want to learn y = sin(x), but you only get 12 noisy training points. Fit polynomials of different complexity and compare training vs. test error (test = many clean points). Real output from the code in §4:
| Model | train MSE | test MSE | Verdict |
|---|---|---|---|
| degree-1 line | 0.308 | 0.205 | underfit — too rigid, high error everywhere |
| degree-3 curve | 0.028 | 0.015 | just right — both low |
| degree-11 curve | 0.000 | 0.080 | overfit — perfect on train, worse on test |
The degree-11 polynomial threads exactly through all 12 noisy dots (train MSE = 0.000) — and pays for it, wiggling between them and doing worse on unseen points than the simple degree-3 fit. That single table is the whole lesson: low training error is not the goal; low error on data you haven't seen is.
Training loss keeps falling; validation loss falls then rises as the model memorizes. The gap is overfitting — you stop at the validation minimum (early stopping).
A complete, runnable demonstration of underfit → good → overfit in ~15 lines of NumPy.
import numpy as np
rng = np.random.default_rng(0)
# The TRUE relationship we want to learn. In real life we never get to see this —
# we only see noisy samples of it.
def true_fn(x):
return np.sin(x)
# Training set: 12 points, each corrupted by random noise (std 0.3).
x_train = np.linspace(0, 2 * np.pi, 12)
y_train = true_fn(x_train) + rng.normal(0, 0.3, size=x_train.shape)
# Test set: 200 CLEAN points the model never trains on — our generalization probe.
x_test = np.linspace(0, 2 * np.pi, 200)
y_test = true_fn(x_test)
def fit_and_score(degree):
coeffs = np.polyfit(x_train, y_train, degree) # TRAINING: tune parameters on train data only
f = np.poly1d(coeffs) # the learned function f_θ
train_mse = np.mean((f(x_train) - y_train) ** 2) # error on data it has seen
test_mse = np.mean((f(x_test) - y_test) ** 2) # error on data it has NOT seen
return train_mse, test_mse
for degree in (1, 3, 11):
tr, te = fit_and_score(degree)
print(f"degree={degree:2d} train_mse={tr:.3f} test_mse={te:.3f}")Line by line: true_fn is the hidden pattern. y_train adds noise — the model must not trust it too literally. np.polyfit(x, y, degree) is the entire training step: it finds polynomial coefficients (the parameters θ) minimizing squared error on the training data — that's §3.5's argmin in one call. degree is the hyperparameter controlling model flexibility. We then score on train and on a held-out test set. Running it reproduces the table above: as degree climbs, train error marches to zero while test error bottoms out at degree 3 and then rises — overfitting, made visible.
Setup. true_fn(x) = sin(x) is the real relationship hiding behind the noise — in real life, we never see this. x_train and y_train are 12 samples from that sine curve, but each corrupted by random noise (std 0.3); the model will only see the noisy version. x_test and y_test are 200 clean points — a test set the model never trains on, used to measure generalization.
The training step. np.polyfit(x_train, y_train, degree) does all the heavy lifting: it fits a polynomial of the given degree to the noisy training data, finding the coefficients (θ) that minimize squared error on those 12 points. f = np.poly1d(coeffs) wraps those coefficients into a callable function.
Scoring. train_mse measures error on the data the model has seen (expected to be low if the fit is good). test_mse measures error on clean, unseen data (the honest report card). As degree increases, train_mse drops because a more flexible polynomial can thread through the noisy training points — but if it memorizes noise, test_mse rises. That rise is overfitting: the model got too specific to the 12 training examples.
Summary: the code demonstrates the core lesson — fitting a model to training data and watching train-vs-test error. When they diverge, the model has overfit.
x); parameters are the learned knobs (θ) inside the model.You're ready to move on when you can explain to a fellow engineer — without notes — what features, labels, parameters, training, inference, and generalization are, name the four learning paradigms with an example of each, and describe overfitting plus one cure.
Next: /ml-foundations/math-for-ml — the vectors, matrices, derivatives, and probability you'll lean on everywhere — then /ml-foundations/how-models-learn, where gradient descent actually turns the knobs.