ML Foundations (for engineers)
PrereqIC3

Math For ML

The dot product is similarity, a matrix is a function, a gradient is the arrow pointing downhill, and cross-entropy is the loss you minimize — the four ideas that make every model in this atlas legible instead of magic.

12 min read · 11 sections
Prerequisites: What is ML? — start at /ml-foundations/what-is-ml, high-school algebra (you can read a sum and a function)

1. The one-sentence intuition

ML runs on four pieces of math, and each is a thing you already use as a programmer: a vector is a list of numbers, a matrix is a pure function that maps one list to another, a gradient is the compiler-computed arrow telling you which way to nudge your parameters to reduce error, and a probability distribution is just normalized scores that sum to 1. You do not need to do this math by hand — autograd and BLAS do it — but you need to read it, the way you read a stack trace: fluently enough that "we take the dot product of the query and key vectors" lands as an obvious sentence, not a wall.

2. Why a software engineer needs this

Every downstream pillar quietly assumes this vocabulary:

  • RAG is "embed text into vectors, then rank by dot-product similarity." If "dot product" is fuzzy, chunking and embeddings reads as hand-waving.
  • Transformers are matrix multiplies plus softmax, end to end. Attention is literally softmax(QKᵀ/√d)·V — three operations from this lesson.
  • Fine-tuning is gradient descent on a loss. Every word there — gradient, loss, cross-entropy, log-likelihood — is defined below.
  • Evals report perplexity and log-loss; those are the probability section.

Interviewers won't ask you to prove a theorem. They'll say "why cosine similarity?" or "what does loss.backward() actually compute?" and expect a crisp, correct answer in 30 seconds. This lesson gives you exactly those answers and nothing you won't use.

3. Build it up from scratch

Three buckets: linear algebra (how data and models are represented), calculus (how models learn), probability (how we score predictions). We teach the real techniques — just the slice each one actually contributes.

Beginner explainerNew here? The words first

The words first.

  • Vector — an ordered list of numbers representing one data point (like an embedding of 1024 values).
  • Matrix — a 2-D grid of learned numbers that transforms vectors; think of it as a learned function.
  • Dot product — multiply aligned elements and sum them, measuring how similarly two vectors point.
  • Gradient — the direction of steepest uphill; step the opposite way to reduce error.
  • Softmax — turn raw scores into probabilities that sum to 1.
  • Cross-entropy — a loss function that penalizes wrong and unconfident predictions.

Step by step.

  1. Read a vector as a point in space — a 3-element vector [2, 1, 3] is a point at coordinates (2, 1, 3).
  2. Multiply two vectors element-by-element, then sum — that's the dot product: [2, 1] · [3, 4] = 2·3 + 1·4 = 10.
  3. Feed a vector through a matrix — it's matrix-multiply: each output is one dot product of a row with the input.
  4. Softmax the outputs — exponentiate each, divide by the sum; now they're positive and sum to 1.
  5. Compare to truth with cross-entropy−log(prob of right answer) is your loss; minimize it with gradient descent.
  6. Gradient descent is automatic.backward() computes the gradient via the chain rule; update parameters in the opposite direction.

Remember this: everything in deep learning is matrix multiply → softmax → cross-entropy, repeated and nested — and calculus tells you how to tweak the parameters to make error shrink.

3.1 Linear algebra — data and models are arrays

A vector is an ordered list of numbers, written x = [x₁, x₂, …, xₙ]. The subscript is just an index, like x[i]. Geometrically it's a point (or an arrow from the origin) in n-dimensional space. In ML, everything becomes a vector: an embedding is a vector of (say) 1024 numbers that represents the meaning of a word or document, learned so that similar meanings land at nearby points.

The dot product is the one operation to internalize. For two vectors a and b of the same length:

ƒ
ab=a1b1+a2b2++anbn=iaibia \cdot b = a_1 b_1 + a_2 b_2 + \dots + a_n b_n = \sum_{i} a_i b_i
Dot product — on real numbers

What each symbol means: a and b are two vectors (lists of numbers); aᵢ and bᵢ are their individual elements; the Σ (sum) symbol means add up all the products aᵢ·bᵢ.

A concrete example: let a = [1, 2, 3] and b = [2, 0, 1]. Align them:

  • Position 1: 1 · 2 = 2
  • Position 2: 2 · 0 = 0
  • Position 3: 3 · 1 = 3
  • Sum: 2 + 0 + 3 = 5

So a · b = 5. If the two vectors pointed the exact same direction and had the same length, you'd get a large positive number; if they were perpendicular (unrelated), you'd get zero. The dot product is literally "how much do they align?"

What just happened: we turned two vectors into a single similarity score. That's the entire trick of vector search and attention — embed the query and the document, dot them, and the bigger the number, the more relevant they are.

The Σ (capital sigma) means "sum over all i" — it's a for loop that adds up the elementwise products. Example: a = [1, 2, 3], b = [2, 0, 1]a·b = 1·2 + 2·0 + 3·1 = 5.

Why it matters: the dot product measures alignment. It equals |a| |b| cos θ, where |a| is the length (or norm) of a — its magnitude, |a| = √(Σ aᵢ²) — and θ is the angle between the two vectors. So:

  • vectors pointing the same way → large positive dot product,
  • perpendicular (unrelated) → zero,
  • opposite → negative.

Strip out the lengths and you get cosine similarity = a·b / (|a||b|), a pure "how aligned are these directions" score in [-1, 1]. That single formula is how vector search finds the chunk most relevant to your query: embed both, take the cosine, sort. Same operation is the heart of attention — "how much should token A attend to token B" is a dot product of their query/key vectors.

A matrix is a 2-D grid of numbers, W with entry Wᵢⱼ at row i, column j. The right mental model isn't "spreadsheet" — it's a function. A matrix is a linear transformation: feed it a vector, get back a vector, via matrix–vector multiply:

ƒ
y=Wx,yi=jWijxjy = W x, \qquad y_i = \sum_j W_{ij}\, x_j
Matrix-vector multiply — on real numbers

What each symbol means: W is a matrix (rows and columns of learned numbers); x is an input vector; y is the output vector; Wᵢⱼ is the element in row i, column j; yᵢ = Σⱼ Wᵢⱼ·xⱼ says "output element i is the dot product of row i of W with the whole input x."

A concrete example: let x = [1, 2] and W = [[1, 0], [2, 3], [0, 1]] (3 rows, 2 columns — 3 classes or outputs).

  • Output 1: [1, 0] · [1, 2] = 1·1 + 0·2 = 1
  • Output 2: [2, 3] · [1, 2] = 2·1 + 3·2 = 8
  • Output 3: [0, 1] · [1, 2] = 0·1 + 1·2 = 2

So y = [1, 8, 2]. Each row of W is a learned "detector" — it looks for a specific pattern in the input.

What just happened: one neural network layer. The matrix W is the learned weights; each output is asking "how much does this input match this learned pattern?" Stack ten thousand of these, and you have a deep network.

Read that as: output element i is the dot product of row i of W with the input x. A neural network layer is exactly this — y = Wx + b (the +b shifts the result, like an intercept) — so a layer is "apply this learned linear function, then a nonlinearity." The numbers in W are the learned parameters; training is the search for good ones.

Matrix–matrix multiply stacks this: (AB)ᵢⱼ = Σₖ Aᵢₖ Bₖⱼ — entry (i,j) is the dot product of row i of A with column j of B. This is the compute-bottleneck operation of all of deep learning; it's what GPUs are built to do fast and what "FLOPs" mostly counts. When someone says a model is "a stack of matmuls," they mean it almost literally.

3.2 Calculus — how a model improves

Training = "tweak the parameters to make the error smaller." Calculus tells you which way to tweak.

A derivative is a slope. For a one-input function f(x), the derivative f'(x) (also written df/dx) answers: if I nudge x up a tiny bit, how fast does f change, and in which direction? Positive slope → f rises as x rises; negative → f falls. Formally it's the limit of rise-over-run as the step shrinks to zero, but the operational meaning is just local sensitivity: Δf ≈ f'(x)·Δx.

Models have millions of inputs (parameters), not one. The gradient generalizes the derivative to many variables. For f(x₁,…,xₙ), the gradient is the vector of partial derivatives — each ∂f/∂xᵢ is "the slope in the xᵢ direction, holding the others fixed" (the curly just signals 'partial'):

ƒ
f=(fx1,fx2,,fxn)\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right)

("nabla") is the gradient operator. The key fact: the gradient points in the direction of steepest increase of f. So to decrease f (your error), step in the opposite direction. That's gradient descent:

ƒ
xxηfx \leftarrow x - \eta\, \nabla f
Gradient descent — on real numbers

What each symbol means: x is a parameter (or a vector of parameters); ∇f is the gradient (the direction of steepest increase); η (eta) is the learning rate (a small positive number like 0.01); means "replace with."

A concrete example: suppose f(x) = x² (a simple bowl-shaped loss) and you start at x = 3.

  • Gradient at x = 3: ∇f = df/dx = 2x = 6 (the slope is 6; the function rises as x increases).
  • Learning rate: η = 0.1.
  • Update: x ← 3 − 0.1·6 = 3 − 0.6 = 2.4.
  • Check: f(3) = 9, f(2.4) = 5.76. Loss fell.
  • Next: at x = 2.4, ∇f = 4.8, so x ← 2.4 − 0.1·4.8 = 1.92.

Repeated, you slide toward x = 0 (the minimum).

What just happened: the gradient is the compiler/autograd telling you "if you nudge x up, f rises this fast." So you nudge the opposite direction, and the error shrinks. Do this for billions of weights, and you train a model.

η ("eta") is the learning rate — a small positive step size you choose (e.g. 0.01). The arrow means "update in place." Repeat until the error stops dropping. That one line is how every model in this atlas is trained.

Worked micro-example. Let f(x,y) = x² + y² (a bowl; minimum at the origin). Its gradient is ∇f = (2x, 2y). At the point (3, 4), ∇f = (6, 8). Step downhill with η = 0.1: new point = (3, 4) − 0.1·(6, 8) = (2.4, 3.2). Check: f went from 9+16 = 25 down to 5.76+10.24 = 16. One step, error fell. Iterate and you slide to the bottom.

◐ InteractiveGradient descent: feel the learning rate
minimum
step 0x -1.600loss 3.809-1.656

Descending — following the negative gradient downhill. Keep stepping.

The chain rule is the engine that makes this work for deep networks. A network is functions nested inside functions: loss(layer3(layer2(layer1(x)))). The chain rule says the derivative of a composition is the product of the local derivatives:

ƒ
dzdx=dzdydydxwhere z=f(y),  y=g(x)\frac{dz}{dx} = \frac{dz}{dy}\cdot\frac{dy}{dx} \quad\text{where } z = f(y),\; y = g(x)

Concretely: z = (3x+1)². Let u = 3x+1, so z = u². Then dz/dx = (dz/du)(du/dx) = (2u)(3) = 6(3x+1). At x=1: 6·4 = 24.

Backpropagation ("backprop") is just the chain rule applied right-to-left across the whole network: do a forward pass to compute the loss, then walk backward multiplying local derivatives to get ∂loss/∂w for every parameter w at once. It's reverse-mode automatic differentiation. You never write it by hand — loss.backward() does it — but now you know exactly what that call computes: the gradient, via the chain rule.

3.3 Probability — scoring predictions

Models output scores; probability turns scores into calibrated predictions and gives us something principled to minimize.

A distribution assigns a probability to each possible outcome, with the rule that the probabilities are non-negative and sum to 1. "70% cat, 20% dog, 10% bird" is a distribution over 3 classes.

Softmax is how a model produces one. The network's raw output scores are called logits — unbounded real numbers, e.g. [1.0, 3.0, 1.0]. Softmax exponentiates and normalizes them into a distribution:

ƒ
softmax(z)i=ezijezj\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
Softmax — on real numbers

What each symbol means: z is a vector of logits (raw, unbounded scores); e is the exponential (about 2.718); softmax(z)ᵢ is the i-th output (a probability for class i); Σⱼ sums over all classes.

A concrete example: logits z = [1.0, 3.0, 1.0] for three classes.

  • Exponentials: e^1.0 ≈ 2.72, e^3.0 ≈ 20.09, e^1.0 ≈ 2.72.
  • Sum: 2.72 + 20.09 + 2.72 ≈ 25.53.
  • Probabilities:
    • Class 0: 2.72 / 25.53 ≈ 0.107
    • Class 1: 20.09 / 25.53 ≈ 0.787
    • Class 2: 2.72 / 25.53 ≈ 0.107

Result: [0.107, 0.787, 0.107] — all non-negative, summing to 1, and class 1 (the highest logit) gets the highest probability.

What just happened: we normalized raw scores into a true probability distribution. The higher logit gets amplified (exponential does that), and the denominator ensures they sum to 1. Now you can apply cross-entropy loss.

e^{z} is the exponential (always positive, so no negative "probabilities"); the denominator forces the outputs to sum to 1. For logits [1, 3, 1] you get [0.107, 0.787, 0.107]. Bigger logit → bigger share, and the gap is amplified.

Expectation is a probability-weighted average: E[X] = Σ x·p(x). It's the value you'd get on average over many draws; it shows up whenever we average a loss over data or over a model's outputs.

Conditional probability is P(A|B) — "probability of A given B is known" — defined as P(A,B)/P(B). A language model is one giant conditional: P(next token | all previous tokens). Bayes' rule flips a conditional around:

ƒ
P(HD)=P(DH)P(H)P(D)P(H \mid D) = \frac{P(D \mid H)\, P(H)}{P(D)}

Read H as a hypothesis (e.g. "this email is spam") and D as observed data (the words). It says: posterior belief = (how well the hypothesis explains the data) × (prior belief), normalized. It's the backbone of probabilistic reasoning and the namesake of the Naive Bayes classifier.

Why log-likelihood and cross-entropy? Training a probabilistic model means maximum likelihood estimation (MLE): pick parameters that make the observed data most probable. The probability of the whole dataset (assuming independent examples) is a giant product ∏ᵢ p(xᵢ; θ). Products of thousands of small numbers underflow to zero and are painful to differentiate. So we take the log: log ∏ = Σ log, turning the product into a sum. Log is monotonic, so the maximizer is unchanged — but the math becomes a clean, stable sum of log p. We then flip the sign (optimizers minimize), giving negative log-likelihood, the thing we descend on.

For classification this is exactly cross-entropy loss. With a true label that's "100% the correct class" (a one-hot target) and model probabilities q, cross-entropy collapses to:

ƒ
loss=logq(correct class)\text{loss} = -\log q(\text{correct class})

Worked example. True class = "cat," model says cat 0.7 → loss = −log(0.7) ≈ 0.357. A different run says cat 0.1 → loss = −log(0.1) ≈ 2.303. Confident and right → tiny loss; confident and wrong → big loss. That's why we minimize cross-entropy instead of accuracy: it's smooth and differentiable (so gradient descent has a slope to follow) and it rewards calibrated confidence, whereas accuracy is a flat step function with no gradient to learn from.

4. See it in code

A full classifier forward pass — matmul → softmax → cross-entropy — is just the math above:

import numpy as np
 
x = np.array([2.0, 1.0, 1.0])                 # input feature vector (length 3)
W = np.array([[ 1.0, -1.0,  0.0],             # learned weights: 3 classes (rows)
              [ 0.0,  2.0,  1.0],             #   x 3 features (cols)
              [ 1.0,  0.0, -1.0]])
 
logits = W @ x                                # matrix-vector multiply -> [1., 3., 1.]
                                              # each entry = dot(row_of_W, x)
 
probs = np.exp(logits) / np.exp(logits).sum() # softmax -> [0.107, 0.787, 0.107]
 
true_class = 1                                # the correct label is class #1
loss = -np.log(probs[true_class])             # cross-entropy = -log(0.787) ≈ 0.24
print(logits, probs, loss)

Line by line: W @ x (the @ is matmul) produces the logits, one dot product per class. np.exp(...)/...sum() is softmax, normalizing logits into a probability distribution. -np.log(probs[true_class]) is cross-entropy: because the model put 0.787 on the right class, the loss is low (≈0.24). Drop that to 0.1 and the loss jumps to ≈2.3 — the penalty for confident wrongness.

Classifier code, section by section

Section 1 (lines 136–139): Set up input and weights.

We have an input vector x = [2.0, 1.0, 1.0] (three features, like pixels or extracted info) and a weight matrix W (3 rows for 3 classes, 3 columns for 3 features). Each row is a learned pattern detector.

Section 2 (lines 141–142): Compute logits via matrix multiply.

W @ x multiplies each row of W by the input:

  • Row 0: 1.0·2 + (−1.0)·1 + 0·1 = 1.0
  • Row 1: 0·2 + 2.0·1 + 1.0·1 = 3.0
  • Row 2: 1.0·2 + 0·1 + (−1.0)·1 = 1.0

Logits = [1.0, 3.0, 1.0] — raw, unbounded scores for the three classes.

Section 3 (lines 144): Apply softmax.

np.exp(logits) / np.exp(logits).sum() exponentiate and normalize: you get [0.107, 0.787, 0.107] — class 1 has 78.7% confidence, classes 0 and 2 split the rest.

Section 4 (lines 146–147): Compute loss.

True class is 1 (class 1 is correct). Cross-entropy = −log(0.787) ≈ 0.24 — a low loss because we were confident and right. The -log is the penalty: −log(0.1) ≈ 2.3 (huge penalty for confident wrongness).

What this code does: forward pass through a 3-way classifier. The model made prediction [0.107, 0.787, 0.107]; the loss is 0.24. Update weights by stepping backward (gradient descent) to make that loss smaller.

And the calculus half — the gradient via the chain rule — is one line with autograd:

import torch
 
x = torch.tensor(1.0, requires_grad=True)  # track gradients for x
z = (3 * x + 1) ** 2                        # forward pass builds the graph
z.backward()                                # backprop: chain rule, fills x.grad
print(x.grad)                               # tensor(24.) == 6*(3*1+1), as we derived

z.backward() walks the computation graph backward, multiplying local derivatives (d z/d u = 2u, d u/d x = 3) to land on dz/dx = 24 — exactly our hand-derived chain-rule answer. Scale this from one variable to a billion and you have how every model trains.

Backprop code, section by section

Section 1 (lines 158): Create a tracked variable.

x = torch.tensor(1.0, requires_grad=True) creates x = 1.0, and tells PyTorch "remember every operation on x so we can differentiate backward."

Section 2 (line 159): Forward pass.

z = (3*x + 1)**2 builds a computation graph: z depends on x through two operations — multiply by 3, add 1, then square. PyTorch tracks this chain.

Section 3 (line 160): Backward pass (backpropagation).

z.backward() applies the chain rule in reverse:

  • Start at z with "derivative of z with respect to itself" = 1.
  • Work backward: dz/du = d(u²)/du = 2u = 2(3·1+1) = 8 (where u = 3x+1).
  • Continue: du/dx = 3.
  • Chain: dz/dx = 8 · 3 = 24.

Section 4 (line 161): Print the result.

x.grad is now 24.0 — the derivative we hand-calculated earlier. That's the gradient: "if you nudge x up by a tiny amount, z goes up by about 24 times that amount."

What this code does: automatic differentiation. You write the forward pass, call .backward(), and PyTorch fills in every gradient. Scale this from 1 variable to 7 billion (like GPT-3), and you have how modern deep learning works — the chain rule automated and run on GPUs.

5. Mental models & SWE analogies

ML idea What it really is, in SWE terms
Vector / embedding A fixed-length array; the model's "hash" of meaning, where distance encodes similarity instead of being random
Matrix A pure function Array → Array; the weights are its (learned) source code
Matrix multiply The hot loop / inner kernel — what profiling would flag, what GPUs are ASICs for
Gradient The compiler telling you the direction of steepest increase; you step the other way to optimize
Backprop (chain rule) Reverse-mode autodiff = dynamic programming over the call graph: compute every parameter's blame in one backward sweep
Softmax normalize() for scores — squashes arbitrary reals into probabilities that sum to 1
Cross-entropy loss The test assertion you minimize: smooth, differentiable, and harsher the more confidently wrong you are

6. Common confusions

  • "A matrix is a table of data." No — treat it as a function. The data flowing through is the vector; the matrix is the operation applied to it.
  • "Dot product = element-wise multiply." Element-wise multiply keeps a vector; the dot product sums those products down to a single number (the similarity score).
  • "The gradient points toward the minimum." It points toward the steepest increase. Gradient descent moves in the negative gradient direction.
  • "Logits are probabilities." Logits are raw, unbounded scores. Only after softmax do they become probabilities in [0,1] that sum to 1.
  • "Backprop is a separate algorithm from the chain rule." Backprop is the chain rule, applied systematically from the loss backward through the network.
  • "We take logs to make numbers smaller." We take logs to turn an underflow-prone product into a stable sum — and because log is monotonic, the optimum is unchanged.
  • "Bigger learning rate = faster training, always." Too large η overshoots and diverges; too small crawls. It's a tuned knob, not "more is better."

7. Check yourself

[Prereq] What does the dot product measure, and why is it the core of vector search? It measures alignment between two vectors (a·b = |a||b|cosθ): large positive when they point the same way, zero when perpendicular, negative when opposed. Embeddings place similar meanings at nearby directions, so the dot product (or its normalized form, cosine similarity) ranks how relevant a document is to a query — exactly what retrieval needs.
[Prereq] What is a gradient and how does gradient descent use it? The gradient is the vector of partial derivatives of the loss with respect to each parameter; it points in the direction of steepest increase. Gradient descent takes a small step in the opposite direction (w ← w − η∇L) to reduce the loss, repeating until it stops improving.
[IC3] Why minimize cross-entropy instead of accuracy? Cross-entropy (−log q(correct class)) is smooth and differentiable, so gradient descent has a slope to follow; it also rewards calibrated confidence — penalizing confidently-wrong predictions heavily. Accuracy is a flat step function with zero gradient almost everywhere, so there's nothing for the optimizer to descend.
[IC3] What does loss.backward() compute, mechanically? It runs backpropagation: starting from the scalar loss, it walks the computation graph in reverse, applying the chain rule to multiply local derivatives and accumulate ∂loss/∂w for every parameter in a single pass — reverse-mode automatic differentiation.
[Prereq] What's the difference between logits and a probability distribution? Logits are the model's raw, unbounded scores. Softmax exponentiates and normalizes them into a distribution: all non-negative and summing to 1.

You're ready to move on when you can read "softmax(QKᵀ/√d)·V" and "minimize the negative log-likelihood via gradient descent" and narrate, in plain English, what each piece does.

8. Go deeper

  • Stanford CS229 — Linear Algebra review and Probability review: the exact two handouts ML courses assume; skimmable in an evening.
  • 3Blue1Brown — Essence of Linear Algebra: the canonical visual intuition for "matrix = transformation." Pair with his Essence of Calculus for derivatives.
  • CS231n — Backprop notes: the clearest from-scratch derivation of the chain rule as backprop, with circuit diagrams.
  • Dive into Deep Learning — Mathematics appendix: runnable, ML-first treatment of every concept here.

Next: How models learn puts this gradient into the full training loop — loss surfaces, learning rates, SGD, and overfitting — then neural networks stacks the matmuls into a real model.

Primary sources
← More in ML Foundations (for engineers)