ML Foundations (for engineers)
PrereqIC3

What Is Machine Learning?

ML is writing a program by showing it examples instead of typing the rules — and the whole game is whether that learned program works on data it has never seen.

13 min read · 16 sections
Prerequisites: none — start here

1. The one-sentence intuition

Machine learning is writing a program by showing it examples instead of typing the rules by hand. You don't code if email contains "viagra" then spam; you collect 100,000 emails labeled spam/not-spam and let an algorithm fit a function that maps email → label. The "code" is a big pile of numbers (the model's parameters) that got tuned to match your examples. As a SWE: it's like having a fuzzy spec and an enormous test suite, and a compiler that searches for any implementation that passes the tests — except the artifact it produces is a numeric function, not source you'd read.

The catch — and the entire field, really — is this: passing the tests you showed it is easy. The job is making it work on inputs it has never seen.

2. Why a software engineer needs this

Every pillar downstream of here is ML wearing a costume:

  • Transformers / LLMs (/transformers) are one giant learned function. "It's just predicting the next token" only means something once you know what learning a function from data is.
  • Fine-tuning (/finetuning) = continuing the training process on your own examples. You can't reason about LoRA, SFT, or DPO without the train/validate/test and overfitting vocabulary from this lesson.
  • RAG (/rag) leans on embeddings — vectors that come out of a self-supervised model. "Why does cosine similarity find related docs?" is an ML question.
  • Evals (/evals) are literally the test set from this lesson, formalized. Every "our agent scores 82%" claim is a generalization claim.

Interviews silently assume you can say, in plain English, what supervised vs. self-supervised learning is, why we hold out a test set, and what overfitting looks like. Get these wrong and you sound like someone who has only used the API, never understood the machine. This lesson is the map; the rest of /ml-foundations fills in the territory.

3. Build it up from scratch

Beginner explainerNew here? The foundational terms

The words first.

  • Features (x) — the input as numbers (e.g., a house is [sqft, bedrooms, price], an image is pixel values).
  • Label (y) — the right answer you want the model to predict (a price, a category, the next word).
  • Model (f) — the function shape you choose, like a line or a neural network.
  • Parameters (θ, theta) — the knobs inside the model that training adjusts; learning finds the best θ.
  • Loss — a number that measures how wrong the model's prediction is; lower is better.
  • Generalization — whether the model works on new data it has never seen (the whole point).

Step by step.

  1. You have examples: pairs of (features, label) — emails and spam/not-spam labels.
  2. You pick a model shape (e.g., a line f(x) = w·x + b).
  3. You calculate the loss — how far the model's predictions miss the true labels.
  4. You adjust the knobs (parameters) to make the loss smaller on your training examples.
  5. You test the trained model on new data you never showed it during training.
  6. If it works well on new data, it generalized; if it only works on the training examples, it overfitted (memorized).

Remember this: Machine learning is searching for parameter values that make a function fit your examples well enough that it also works on unseen data.

3.1 Programming vs. learning

Traditional programming Machine learning
You write the rules (logic) the examples (data) + a model shape
The computer produces the output the rules (parameters), then outputs
Good when rules are knowable & stable rules are fuzzy, high-dimensional, or unknown
Debugging read the code inspect the data & errors

Nobody can write the if/else for "is this photo a cat" or "is this the right next word." There are too many cases and no crisp rule. So instead we pick a flexible function with knobs and turn the knobs until it matches examples. That's it.

3.2 The core objects: features, labels, model, parameters

  • Features (x): the input, as numbers. A house is [sqft, bedrooms, zip]; an image is a grid of pixel values; text becomes token IDs. Turning raw stuff into useful numbers is feature engineering (deep learning automates much of it).
  • Label (y): the answer you want — a price, a category, the next word.
  • Model (f): the function shape, e.g. a line, a tree, a neural network.
  • Parameters (θ, "theta"): the knobs inside f that learning adjusts. A linear model f(x) = w·x + b has parameters w (weights) and b (bias); GPT-class models have hundreds of billions.

The model with its knobs set is written f_θ — "f parameterized by θ." Learning = finding good θ.

3.3 The four kinds of learning

The flavors differ only in what kind of examples (supervision signal) you get.

Type What you give it Example "Label" comes from
Supervised inputs with correct answers emails → spam/not-spam; house → price humans / measurement
Unsupervised inputs only, find structure cluster customers; reduce dimensions nothing — it finds patterns
Self-supervised inputs only, but invent a labeled task from them hide a word, predict it the data itself
Reinforcement (RL) an environment + a reward signal game-playing; "was this answer helpful?" trial, error, reward
  • Supervised is the workhorse: classification (discrete label — cat/dog, spam/ham) and regression (continuous label — a price, a temperature). Most "ML 101" is here.
  • Unsupervised finds structure with no answer key: clustering (group similar items, e.g. k-means) and dimensionality reduction (compress features, e.g. PCA).
  • Self-supervised is the trick that powers modern AI. There's no human labeling 10 trillion words. Instead you take raw text and manufacture a supervised problem from it for free: "the cat sat on the ___" → label is mat. The input and the label both come from the same sentence. Do this billions of times and the model learns grammar, facts, and reasoning as a side effect of getting good at filling blanks. This is pretraining.
  • Reinforcement learning has no fixed answer key — only a reward that says how good an outcome was. The model takes actions, gets scored, and shifts toward actions that earn more reward. Great for sequential decisions (games, robotics, agents).

Which one do LLMs use? Both, in stages. First self-supervised pretraining (next-token prediction over the internet) to learn language and world knowledge. Then post-training to make it a helpful assistant, classically via RLHF — Reinforcement Learning from Human Feedback: humans rank model answers, that trains a reward model, and RL nudges the LLM toward higher-reward (more helpful, less harmful) responses. So a chatbot is self-supervised pretraining plus an RL-style alignment step. You'll go deep on this in /finetuning.

3.4 Training vs. inference

Two distinct phases, and conflating them is a classic beginner error:

  • Training (offline, expensive, once-ish): show examples, adjust θ to reduce error. This is where GPUs burn for weeks.
  • Inference (online, cheap-ish, every request): freeze θ, feed a new x, read off f_θ(x). Every time you call an LLM API, you're doing inference; the parameters don't change.

Think compile-time vs. run-time. Training is the slow build that bakes the artifact; inference is calling the built binary.

3.5 The minimal math: a loss and an objective

To "turn the knobs," you need a number that says how wrong the model currently is. That's the loss function L. For regression, a standard choice is squared error: L(ŷ, y) = (ŷ − y)², where ŷ = f_θ(x) is the prediction ("y-hat") and y is the truth. Squaring makes all errors positive and punishes big misses harder.

Learning is then an optimization problem — find the θ that minimizes average loss over your N training examples:

The learning objective — on real numbers

Every symbol in plain words. The formula finds the parameter values θ (the knobs) that make the average loss as small as possible across all N training examples. argmin_θ means "pick the θ that makes the following expression smallest." Σᵢ means "add up over all i from 1 to N." L(...) is the loss for one example — how wrong the prediction f_θ(xᵢ) is compared to the true label yᵢ. Dividing by N gives the average loss.

Concrete example with real numbers. Say you're learning to predict house prices. You have N = 3 training examples:

  • Example 1: x₁ = [2000 sqft], y₁ = $300k (true price)
  • Example 2: x₂ = [1500 sqft], y₂ = $250k
  • Example 3: x₃ = [3000 sqft], y₃ = $400k

You pick a linear model f_θ(x) = w·x + b. On the first pass, guess w = 0.1, b = 0. Your predictions are:

  • f_θ(x₁) = 0.1 * 2000 + 0 = 200 (in thousands) — the model says $200k
  • f_θ(x₂) = 0.1 * 1500 + 0 = 150 — the model says $150k
  • f_θ(x₃) = 0.1 * 3000 + 0 = 300 — the model says $300k

Using squared error loss L(ŷ, y) = (ŷ − y)²:

  • Loss for example 1: (200300= 10000
  • Loss for example 2: (150250= 10000
  • Loss for example 3: (300400= 10000

Average loss (training error) = (10000 + 10000 + 10000) / 310000. Now adjust w and b to smaller values (e.g., w = 0.15, b = 10) and recalculate; the loss shrinks. Learning is repeating this process until no change in θ makes the average loss meaningfully smaller. What just happened: we measured how far off the model was on average, and that single number guides which direction to tune the knobs.

θ* = argmin_θ  (1/N) Σᵢ L( f_θ(xᵢ), yᵢ )

Reading it symbol by symbol: argmin_θ = "the θ that makes the following smallest"; Σᵢ = "sum over training examples i = 1..N"; (xᵢ, yᵢ) = the i-th example's features and label; f_θ(xᵢ) = the model's prediction for it. The quantity being minimized — average loss on the training set — is called the training error (a.k.a. empirical risk). How you actually find θ* is gradient descent, the subject of /how-models-learn; here just hold the idea "search for knob settings that make the loss small."

3.6 Generalization is the whole game

Here's the twist. You don't actually care about training error. You care about loss on future, unseen data drawn from the same source — the generalization error. Minimizing training error is only a proxy; a model can ace the training set and be useless on anything new, the way code can pass exactly the tests you wrote and break in production.

To estimate the thing you actually care about, split your data before training:

Split Used for Touches θ? SWE analogy
Train (~70–80%) fitting parameters θ yes the code you write to pass tests
Validation (~10–15%) tuning hyperparameters, picking the model indirectly your dev/staging runs
Test (~10–15%) one final, honest score no prod traffic you never trained on

A hyperparameter is a setting you choose, not something learned — model size, how long to train, regularization strength. You tune those by checking the validation set. The test set is sacred: you look at it once, at the end. The reason for three splits and not two: the moment you use a set to make decisions, you start (subtly) fitting to it. Validation absorbs that contamination so the test set stays an uncontaminated estimate of real-world performance. Reusing the test set to pick models is data leakage, and it's why a "95% accurate" model can faceplant in production. (More in /ml-foundations/evaluation-and-data.)

3.7 Overfitting, underfitting, and the bias–variance tradeoff

Watch the gap between training error and validation error:

  • Underfitting (high bias): the model is too simple to capture the pattern. Training error is high and validation error is high too. It's a straight line trying to trace a curve. The model has a strong, wrong bias about the shape of the world.
  • Overfitting (high variance): the model is too flexible and memorizes the training examples including their noise. Training error is near zero, but validation error is high. Change a few training points and the fit swings wildly — that sensitivity is variance.
  • Just right: both errors are low and close together.

This is the bias–variance tradeoff. Conceptually, expected test error decomposes as:

expected test error  ≈  bias²  +  variance  +  irreducible noise

bias = error from wrong assumptions (too simple); variance = error from over-sensitivity to the particular training sample (too complex); irreducible noise = randomness no model can remove. Make the model more powerful → bias drops but variance rises; simplify it → the reverse. The sweet spot minimizes the sum. (Modern very-large nets complicate this picture, but the intuition is the right starting mental model — and exactly what interviewers want.)

The cure for overfitting is regularization — anything that discourages the model from memorizing. The intuition: penalize complexity so the model prefers the simplest explanation that fits (Occam's razor, enforced numerically). Common forms: L2/weight decay (add a penalty for large weights, so it can't contort wildly), dropout (randomly silence neurons during training so it can't rely on any single one), early stopping (quit when validation error starts climbing), and the cheapest cure of all — more data. We'll make this concrete in /how-models-learn.

3.8 Worked micro-example

You want to learn y = sin(x), but you only get 12 noisy training points. Fit polynomials of different complexity and compare training vs. test error (test = many clean points). Real output from the code in §4:

Model train MSE test MSE Verdict
degree-1 line 0.308 0.205 underfit — too rigid, high error everywhere
degree-3 curve 0.028 0.015 just right — both low
degree-11 curve 0.000 0.080 overfitperfect on train, worse on test

The degree-11 polynomial threads exactly through all 12 noisy dots (train MSE = 0.000) — and pays for it, wiggling between them and doing worse on unseen points than the simple degree-3 fit. That single table is the whole lesson: low training error is not the goal; low error on data you haven't seen is.

◇ Live illustrationConvergence vs overfitting

Training loss keeps falling; validation loss falls then rises as the model memorizes. The gap is overfitting — you stop at the validation minimum (early stopping).

4. See it in code

A complete, runnable demonstration of underfit → good → overfit in ~15 lines of NumPy.

import numpy as np
 
rng = np.random.default_rng(0)
 
# The TRUE relationship we want to learn. In real life we never get to see this —
# we only see noisy samples of it.
def true_fn(x):
    return np.sin(x)
 
# Training set: 12 points, each corrupted by random noise (std 0.3).
x_train = np.linspace(0, 2 * np.pi, 12)
y_train = true_fn(x_train) + rng.normal(0, 0.3, size=x_train.shape)
 
# Test set: 200 CLEAN points the model never trains on — our generalization probe.
x_test = np.linspace(0, 2 * np.pi, 200)
y_test = true_fn(x_test)
 
def fit_and_score(degree):
    coeffs = np.polyfit(x_train, y_train, degree)   # TRAINING: tune parameters on train data only
    f = np.poly1d(coeffs)                           # the learned function f_θ
    train_mse = np.mean((f(x_train) - y_train) ** 2)  # error on data it has seen
    test_mse  = np.mean((f(x_test)  - y_test)  ** 2)  # error on data it has NOT seen
    return train_mse, test_mse
 
for degree in (1, 3, 11):
    tr, te = fit_and_score(degree)
    print(f"degree={degree:2d}  train_mse={tr:.3f}  test_mse={te:.3f}")

Line by line: true_fn is the hidden pattern. y_train adds noise — the model must not trust it too literally. np.polyfit(x, y, degree) is the entire training step: it finds polynomial coefficients (the parameters θ) minimizing squared error on the training data — that's §3.5's argmin in one call. degree is the hyperparameter controlling model flexibility. We then score on train and on a held-out test set. Running it reproduces the table above: as degree climbs, train error marches to zero while test error bottoms out at degree 3 and then rises — overfitting, made visible.

What this code does, section by section

Setup. true_fn(x) = sin(x) is the real relationship hiding behind the noise — in real life, we never see this. x_train and y_train are 12 samples from that sine curve, but each corrupted by random noise (std 0.3); the model will only see the noisy version. x_test and y_test are 200 clean points — a test set the model never trains on, used to measure generalization.

The training step. np.polyfit(x_train, y_train, degree) does all the heavy lifting: it fits a polynomial of the given degree to the noisy training data, finding the coefficients (θ) that minimize squared error on those 12 points. f = np.poly1d(coeffs) wraps those coefficients into a callable function.

Scoring. train_mse measures error on the data the model has seen (expected to be low if the fit is good). test_mse measures error on clean, unseen data (the honest report card). As degree increases, train_mse drops because a more flexible polynomial can thread through the noisy training points — but if it memorizes noise, test_mse rises. That rise is overfitting: the model got too specific to the 12 training examples.

Summary: the code demonstrates the core lesson — fitting a model to training data and watching train-vs-test error. When they diverge, the model has overfit.

5. Mental models & SWE analogies

  • Training ≈ compile time, inference ≈ run time. Training bakes the artifact (slow, once); inference runs it (fast, every request). API calls to an LLM are pure inference — the weights are frozen.
  • The model is a learned, lossy function — and parameters are its "config" baked from data. Not handwritten config; config discovered by fitting examples.
  • The test set is production traffic you're forbidden to peek at. Tune on staging (validation); ship and measure on prod (test). Peeking early = data leakage = a green dashboard that lies.
  • Overfitting ≈ hardcoding to your test cases. A function that special-cases every input in your test suite passes CI and breaks on the first real user. Regularization is the linter that says "stop hardcoding, write the general solution."
  • Loss ≈ a fitness/cost metric the optimizer descends. Lower is better; learning is search for parameters that minimize it. (The gradient — which direction reduces loss — is the next lesson.)

6. Common confusions

  • "More training accuracy is always better." No — past a point it means memorization. Watch the train-vs-validation gap, not training error alone.
  • "AI = ML = deep learning = LLMs." Nested, not equal. AI ⊃ ML ⊃ deep learning (neural nets) ⊃ LLMs. ML is the part where systems learn from data; the rest are subsets.
  • "The model keeps learning as I use it." Almost never at inference. A deployed LLM has frozen weights; it doesn't remember your last chat unless you re-feed it as context. Updating weights requires a new training run.
  • "Unsupervised and self-supervised are the same." Both use unlabeled data, but self-supervised manufactures a supervised task (predict the hidden word) and trains on it — that's why LLMs learn so much from raw text.
  • "LLMs are trained with reinforcement learning." Only the alignment stage (RLHF). The heavy lifting is self-supervised next-token prediction; RL is a comparatively thin finishing layer.
  • "Features and parameters are the same thing." Features are the input (x); parameters are the learned knobs (θ) inside the model.
  • "Regularization makes the model more accurate." It usually raises training error on purpose — trading a bit of fit for better generalization.

7. Check yourself

[Prereq] How is machine learning different from normal programming? In normal programming you write the rules and the computer produces outputs. In ML you provide examples (data) and a model shape, and the computer produces the rules — the parameters — by fitting those examples. Use it when the rules are too fuzzy or high-dimensional to write by hand (vision, language, recommendations).
[Prereq] What's the difference between training and inference? Training is the offline, expensive phase that adjusts the model's parameters to fit data. Inference is using the frozen, trained model to make predictions on new inputs. Every LLM API call is inference; the weights don't change.
[IC3] Why split data into train/validation/test instead of just train/test? Because the instant you use a dataset to make decisions (pick the model, tune hyperparameters), you start fitting to it. Validation absorbs that contamination, leaving the test set as a clean, one-time estimate of real-world (generalization) performance. Reusing test for selection is data leakage and inflates your reported numbers.
[IC3] Define overfitting and underfitting in terms of bias and variance, and give one fix for each. Underfitting = high bias: model too simple, high error on both train and validation → fix by adding capacity/features. Overfitting = high variance: model memorizes training noise, low train error but high validation error → fix with regularization (weight decay, dropout, early stopping) or more data.
[IC3] Which learning paradigm trains an LLM, and in what order? Self-supervised pretraining (next-token prediction over huge text corpora) for knowledge and language, then post-training/alignment, classically RLHF (reinforcement learning from human feedback) to make it helpful and safe. Pretraining is the bulk; RLHF is the finishing layer.

You're ready to move on when you can explain to a fellow engineer — without notes — what features, labels, parameters, training, inference, and generalization are, name the four learning paradigms with an example of each, and describe overfitting plus one cure.

8. Go deeper

  • Stanford CS229 — Machine Learning (cs229.stanford.edu): the canonical supervised/unsupervised treatment; notes 1–2 cover exactly this framing.
  • Stanford CS231n — Image Classification intro (cs231n.github.io/classification): the clearest "data-driven approach vs. writing rules" explanation, plus train/val/test discipline.
  • Dive into Deep Learning — Generalization (d2l.ai): free, runnable chapter on overfitting, underfitting, and model selection.
  • Goodfellow, Bengio & Courville — Deep Learning, Ch. 5 (deeplearningbook.org): the rigorous reference for capacity, bias–variance, and regularization.
  • scikit-learn — Cross-validation (scikit-learn.org): practical, code-first take on why and how we hold data out.

Next: /ml-foundations/math-for-ml — the vectors, matrices, derivatives, and probability you'll lean on everywhere — then /ml-foundations/how-models-learn, where gradient descent actually turns the knobs.

Primary sources
← More in ML Foundations (for engineers)