The dot product is similarity, a matrix is a function, a gradient is the arrow pointing downhill, and cross-entropy is the loss you minimize — the four ideas that make every model in this atlas legible instead of magic.
ML runs on four pieces of math, and each is a thing you already use as a programmer: a vector is a list of numbers, a matrix is a pure function that maps one list to another, a gradient is the compiler-computed arrow telling you which way to nudge your parameters to reduce error, and a probability distribution is just normalized scores that sum to 1. You do not need to do this math by hand — autograd and BLAS do it — but you need to read it, the way you read a stack trace: fluently enough that "we take the dot product of the query and key vectors" lands as an obvious sentence, not a wall.
Every downstream pillar quietly assumes this vocabulary:
softmax(QKᵀ/√d)·V — three operations from this lesson.Interviewers won't ask you to prove a theorem. They'll say "why cosine similarity?" or "what does loss.backward() actually compute?" and expect a crisp, correct answer in 30 seconds. This lesson gives you exactly those answers and nothing you won't use.
Three buckets: linear algebra (how data and models are represented), calculus (how models learn), probability (how we score predictions). We teach the real techniques — just the slice each one actually contributes.
The words first.
Step by step.
[2, 1, 3] is a point at coordinates (2, 1, 3).[2, 1] · [3, 4] = 2·3 + 1·4 = 10.−log(prob of right answer) is your loss; minimize it with gradient descent..backward() computes the gradient via the chain rule; update parameters in the opposite direction.Remember this: everything in deep learning is matrix multiply → softmax → cross-entropy, repeated and nested — and calculus tells you how to tweak the parameters to make error shrink.
A vector is an ordered list of numbers, written x = [x₁, x₂, …, xₙ]. The subscript is just an index, like x[i]. Geometrically it's a point (or an arrow from the origin) in n-dimensional space. In ML, everything becomes a vector: an embedding is a vector of (say) 1024 numbers that represents the meaning of a word or document, learned so that similar meanings land at nearby points.
The dot product is the one operation to internalize. For two vectors a and b of the same length:
What each symbol means: a and b are two vectors (lists of numbers); aᵢ and bᵢ are their individual elements; the Σ (sum) symbol means add up all the products aᵢ·bᵢ.
A concrete example: let a = [1, 2, 3] and b = [2, 0, 1]. Align them:
1 · 2 = 22 · 0 = 03 · 1 = 32 + 0 + 3 = 5So a · b = 5. If the two vectors pointed the exact same direction and had the same length, you'd get a large positive number; if they were perpendicular (unrelated), you'd get zero. The dot product is literally "how much do they align?"
What just happened: we turned two vectors into a single similarity score. That's the entire trick of vector search and attention — embed the query and the document, dot them, and the bigger the number, the more relevant they are.
The Σ (capital sigma) means "sum over all i" — it's a for loop that adds up the elementwise products. Example: a = [1, 2, 3], b = [2, 0, 1] → a·b = 1·2 + 2·0 + 3·1 = 5.
Why it matters: the dot product measures alignment. It equals |a| |b| cos θ, where |a| is the length (or norm) of a — its magnitude, |a| = √(Σ aᵢ²) — and θ is the angle between the two vectors. So:
Strip out the lengths and you get cosine similarity = a·b / (|a||b|), a pure "how aligned are these directions" score in [-1, 1]. That single formula is how vector search finds the chunk most relevant to your query: embed both, take the cosine, sort. Same operation is the heart of attention — "how much should token A attend to token B" is a dot product of their query/key vectors.
A matrix is a 2-D grid of numbers, W with entry Wᵢⱼ at row i, column j. The right mental model isn't "spreadsheet" — it's a function. A matrix is a linear transformation: feed it a vector, get back a vector, via matrix–vector multiply:
What each symbol means: W is a matrix (rows and columns of learned numbers); x is an input vector; y is the output vector; Wᵢⱼ is the element in row i, column j; yᵢ = Σⱼ Wᵢⱼ·xⱼ says "output element i is the dot product of row i of W with the whole input x."
A concrete example: let x = [1, 2] and W = [[1, 0], [2, 3], [0, 1]] (3 rows, 2 columns — 3 classes or outputs).
[1, 0] · [1, 2] = 1·1 + 0·2 = 1[2, 3] · [1, 2] = 2·1 + 3·2 = 8[0, 1] · [1, 2] = 0·1 + 1·2 = 2So y = [1, 8, 2]. Each row of W is a learned "detector" — it looks for a specific pattern in the input.
What just happened: one neural network layer. The matrix W is the learned weights; each output is asking "how much does this input match this learned pattern?" Stack ten thousand of these, and you have a deep network.
Read that as: output element i is the dot product of row i of W with the input x. A neural network layer is exactly this — y = Wx + b (the +b shifts the result, like an intercept) — so a layer is "apply this learned linear function, then a nonlinearity." The numbers in W are the learned parameters; training is the search for good ones.
Matrix–matrix multiply stacks this: (AB)ᵢⱼ = Σₖ Aᵢₖ Bₖⱼ — entry (i,j) is the dot product of row i of A with column j of B. This is the compute-bottleneck operation of all of deep learning; it's what GPUs are built to do fast and what "FLOPs" mostly counts. When someone says a model is "a stack of matmuls," they mean it almost literally.
Training = "tweak the parameters to make the error smaller." Calculus tells you which way to tweak.
A derivative is a slope. For a one-input function f(x), the derivative f'(x) (also written df/dx) answers: if I nudge x up a tiny bit, how fast does f change, and in which direction? Positive slope → f rises as x rises; negative → f falls. Formally it's the limit of rise-over-run as the step shrinks to zero, but the operational meaning is just local sensitivity: Δf ≈ f'(x)·Δx.
Models have millions of inputs (parameters), not one. The gradient generalizes the derivative to many variables. For f(x₁,…,xₙ), the gradient is the vector of partial derivatives — each ∂f/∂xᵢ is "the slope in the xᵢ direction, holding the others fixed" (the curly ∂ just signals 'partial'):
∇ ("nabla") is the gradient operator. The key fact: the gradient points in the direction of steepest increase of f. So to decrease f (your error), step in the opposite direction. That's gradient descent:
What each symbol means: x is a parameter (or a vector of parameters); ∇f is the gradient (the direction of steepest increase); η (eta) is the learning rate (a small positive number like 0.01); ← means "replace with."
A concrete example: suppose f(x) = x² (a simple bowl-shaped loss) and you start at x = 3.
x = 3: ∇f = df/dx = 2x = 6 (the slope is 6; the function rises as x increases).η = 0.1.x ← 3 − 0.1·6 = 3 − 0.6 = 2.4.f(3) = 9, f(2.4) = 5.76. Loss fell.x = 2.4, ∇f = 4.8, so x ← 2.4 − 0.1·4.8 = 1.92.Repeated, you slide toward x = 0 (the minimum).
What just happened: the gradient is the compiler/autograd telling you "if you nudge x up, f rises this fast." So you nudge the opposite direction, and the error shrinks. Do this for billions of weights, and you train a model.
η ("eta") is the learning rate — a small positive step size you choose (e.g. 0.01). The arrow ← means "update in place." Repeat until the error stops dropping. That one line is how every model in this atlas is trained.
Worked micro-example. Let f(x,y) = x² + y² (a bowl; minimum at the origin). Its gradient is ∇f = (2x, 2y). At the point (3, 4), ∇f = (6, 8). Step downhill with η = 0.1: new point = (3, 4) − 0.1·(6, 8) = (2.4, 3.2). Check: f went from 9+16 = 25 down to 5.76+10.24 = 16. One step, error fell. Iterate and you slide to the bottom.
→ Descending — following the negative gradient downhill. Keep stepping.
The chain rule is the engine that makes this work for deep networks. A network is functions nested inside functions: loss(layer3(layer2(layer1(x)))). The chain rule says the derivative of a composition is the product of the local derivatives:
Concretely: z = (3x+1)². Let u = 3x+1, so z = u². Then dz/dx = (dz/du)(du/dx) = (2u)(3) = 6(3x+1). At x=1: 6·4 = 24.
Backpropagation ("backprop") is just the chain rule applied right-to-left across the whole network: do a forward pass to compute the loss, then walk backward multiplying local derivatives to get ∂loss/∂w for every parameter w at once. It's reverse-mode automatic differentiation. You never write it by hand — loss.backward() does it — but now you know exactly what that call computes: the gradient, via the chain rule.
Models output scores; probability turns scores into calibrated predictions and gives us something principled to minimize.
A distribution assigns a probability to each possible outcome, with the rule that the probabilities are non-negative and sum to 1. "70% cat, 20% dog, 10% bird" is a distribution over 3 classes.
Softmax is how a model produces one. The network's raw output scores are called logits — unbounded real numbers, e.g. [1.0, 3.0, 1.0]. Softmax exponentiates and normalizes them into a distribution:
What each symbol means: z is a vector of logits (raw, unbounded scores); e is the exponential (about 2.718); softmax(z)ᵢ is the i-th output (a probability for class i); Σⱼ sums over all classes.
A concrete example: logits z = [1.0, 3.0, 1.0] for three classes.
e^1.0 ≈ 2.72, e^3.0 ≈ 20.09, e^1.0 ≈ 2.72.2.72 + 20.09 + 2.72 ≈ 25.53.2.72 / 25.53 ≈ 0.10720.09 / 25.53 ≈ 0.7872.72 / 25.53 ≈ 0.107Result: [0.107, 0.787, 0.107] — all non-negative, summing to 1, and class 1 (the highest logit) gets the highest probability.
What just happened: we normalized raw scores into a true probability distribution. The higher logit gets amplified (exponential does that), and the denominator ensures they sum to 1. Now you can apply cross-entropy loss.
e^{z} is the exponential (always positive, so no negative "probabilities"); the denominator forces the outputs to sum to 1. For logits [1, 3, 1] you get [0.107, 0.787, 0.107]. Bigger logit → bigger share, and the gap is amplified.
Expectation is a probability-weighted average: E[X] = Σ x·p(x). It's the value you'd get on average over many draws; it shows up whenever we average a loss over data or over a model's outputs.
Conditional probability is P(A|B) — "probability of A given B is known" — defined as P(A,B)/P(B). A language model is one giant conditional: P(next token | all previous tokens). Bayes' rule flips a conditional around:
Read H as a hypothesis (e.g. "this email is spam") and D as observed data (the words). It says: posterior belief = (how well the hypothesis explains the data) × (prior belief), normalized. It's the backbone of probabilistic reasoning and the namesake of the Naive Bayes classifier.
Why log-likelihood and cross-entropy? Training a probabilistic model means maximum likelihood estimation (MLE): pick parameters that make the observed data most probable. The probability of the whole dataset (assuming independent examples) is a giant product ∏ᵢ p(xᵢ; θ). Products of thousands of small numbers underflow to zero and are painful to differentiate. So we take the log: log ∏ = Σ log, turning the product into a sum. Log is monotonic, so the maximizer is unchanged — but the math becomes a clean, stable sum of log p. We then flip the sign (optimizers minimize), giving negative log-likelihood, the thing we descend on.
For classification this is exactly cross-entropy loss. With a true label that's "100% the correct class" (a one-hot target) and model probabilities q, cross-entropy collapses to:
Worked example. True class = "cat," model says cat 0.7 → loss = −log(0.7) ≈ 0.357. A different run says cat 0.1 → loss = −log(0.1) ≈ 2.303. Confident and right → tiny loss; confident and wrong → big loss. That's why we minimize cross-entropy instead of accuracy: it's smooth and differentiable (so gradient descent has a slope to follow) and it rewards calibrated confidence, whereas accuracy is a flat step function with no gradient to learn from.
A full classifier forward pass — matmul → softmax → cross-entropy — is just the math above:
import numpy as np
x = np.array([2.0, 1.0, 1.0]) # input feature vector (length 3)
W = np.array([[ 1.0, -1.0, 0.0], # learned weights: 3 classes (rows)
[ 0.0, 2.0, 1.0], # x 3 features (cols)
[ 1.0, 0.0, -1.0]])
logits = W @ x # matrix-vector multiply -> [1., 3., 1.]
# each entry = dot(row_of_W, x)
probs = np.exp(logits) / np.exp(logits).sum() # softmax -> [0.107, 0.787, 0.107]
true_class = 1 # the correct label is class #1
loss = -np.log(probs[true_class]) # cross-entropy = -log(0.787) ≈ 0.24
print(logits, probs, loss)Line by line: W @ x (the @ is matmul) produces the logits, one dot product per class. np.exp(...)/...sum() is softmax, normalizing logits into a probability distribution. -np.log(probs[true_class]) is cross-entropy: because the model put 0.787 on the right class, the loss is low (≈0.24). Drop that to 0.1 and the loss jumps to ≈2.3 — the penalty for confident wrongness.
Section 1 (lines 136–139): Set up input and weights.
We have an input vector x = [2.0, 1.0, 1.0] (three features, like pixels or extracted info) and a weight matrix W (3 rows for 3 classes, 3 columns for 3 features). Each row is a learned pattern detector.
Section 2 (lines 141–142): Compute logits via matrix multiply.
W @ x multiplies each row of W by the input:
1.0·2 + (−1.0)·1 + 0·1 = 1.00·2 + 2.0·1 + 1.0·1 = 3.01.0·2 + 0·1 + (−1.0)·1 = 1.0Logits = [1.0, 3.0, 1.0] — raw, unbounded scores for the three classes.
Section 3 (lines 144): Apply softmax.
np.exp(logits) / np.exp(logits).sum() exponentiate and normalize: you get [0.107, 0.787, 0.107] — class 1 has 78.7% confidence, classes 0 and 2 split the rest.
Section 4 (lines 146–147): Compute loss.
True class is 1 (class 1 is correct). Cross-entropy = −log(0.787) ≈ 0.24 — a low loss because we were confident and right. The -log is the penalty: −log(0.1) ≈ 2.3 (huge penalty for confident wrongness).
What this code does: forward pass through a 3-way classifier. The model made prediction [0.107, 0.787, 0.107]; the loss is 0.24. Update weights by stepping backward (gradient descent) to make that loss smaller.
And the calculus half — the gradient via the chain rule — is one line with autograd:
import torch
x = torch.tensor(1.0, requires_grad=True) # track gradients for x
z = (3 * x + 1) ** 2 # forward pass builds the graph
z.backward() # backprop: chain rule, fills x.grad
print(x.grad) # tensor(24.) == 6*(3*1+1), as we derivedz.backward() walks the computation graph backward, multiplying local derivatives (d z/d u = 2u, d u/d x = 3) to land on dz/dx = 24 — exactly our hand-derived chain-rule answer. Scale this from one variable to a billion and you have how every model trains.
Section 1 (lines 158): Create a tracked variable.
x = torch.tensor(1.0, requires_grad=True) creates x = 1.0, and tells PyTorch "remember every operation on x so we can differentiate backward."
Section 2 (line 159): Forward pass.
z = (3*x + 1)**2 builds a computation graph: z depends on x through two operations — multiply by 3, add 1, then square. PyTorch tracks this chain.
Section 3 (line 160): Backward pass (backpropagation).
z.backward() applies the chain rule in reverse:
z with "derivative of z with respect to itself" = 1.dz/du = d(u²)/du = 2u = 2(3·1+1) = 8 (where u = 3x+1).du/dx = 3.dz/dx = 8 · 3 = 24.Section 4 (line 161): Print the result.
x.grad is now 24.0 — the derivative we hand-calculated earlier. That's the gradient: "if you nudge x up by a tiny amount, z goes up by about 24 times that amount."
What this code does: automatic differentiation. You write the forward pass, call .backward(), and PyTorch fills in every gradient. Scale this from 1 variable to 7 billion (like GPT-3), and you have how modern deep learning works — the chain rule automated and run on GPUs.
| ML idea | What it really is, in SWE terms |
|---|---|
| Vector / embedding | A fixed-length array; the model's "hash" of meaning, where distance encodes similarity instead of being random |
| Matrix | A pure function Array → Array; the weights are its (learned) source code |
| Matrix multiply | The hot loop / inner kernel — what profiling would flag, what GPUs are ASICs for |
| Gradient | The compiler telling you the direction of steepest increase; you step the other way to optimize |
| Backprop (chain rule) | Reverse-mode autodiff = dynamic programming over the call graph: compute every parameter's blame in one backward sweep |
| Softmax | normalize() for scores — squashes arbitrary reals into probabilities that sum to 1 |
| Cross-entropy loss | The test assertion you minimize: smooth, differentiable, and harsher the more confidently wrong you are |
[0,1] that sum to 1.η overshoots and diverges; too small crawls. It's a tuned knob, not "more is better."a·b = |a||b|cosθ): large positive when they point the same way, zero when perpendicular, negative when opposed. Embeddings place similar meanings at nearby directions, so the dot product (or its normalized form, cosine similarity) ranks how relevant a document is to a query — exactly what retrieval needs.w ← w − η∇L) to reduce the loss, repeating until it stops improving.−log q(correct class)) is smooth and differentiable, so gradient descent has a slope to follow; it also rewards calibrated confidence — penalizing confidently-wrong predictions heavily. Accuracy is a flat step function with zero gradient almost everywhere, so there's nothing for the optimizer to descend.loss.backward() compute, mechanically?
It runs backpropagation: starting from the scalar loss, it walks the computation graph in reverse, applying the chain rule to multiply local derivatives and accumulate ∂loss/∂w for every parameter in a single pass — reverse-mode automatic differentiation.You're ready to move on when you can read "softmax(QKᵀ/√d)·V" and "minimize the negative log-likelihood via gradient descent" and narrate, in plain English, what each piece does.
Next: How models learn puts this gradient into the full training loop — loss surfaces, learning rates, SGD, and overfitting — then neural networks stacks the matmuls into a real model.