A neural network is just a stack of matrix multiplies glued together by nonlinear "switches" — once you see it as composed functions with learnable constants, "deep learning", embeddings, and the road to transformers stop being magic.
A neural network is a long pipeline of matmul → add bias → squash steps, where the matrices are the learnable constants and "training" is just tuning those constants so the pipeline maps inputs to the outputs you want. If you've ever written a data-transformation pipeline — parse, normalize, project, classify — you already have the shape; the twist is that instead of you writing each stage's logic, gradient descent fills in the numbers. The single non-obvious ingredient is a nonlinearity between stages: without it, the whole stack algebraically collapses into one matrix and can only draw straight lines. With it, stacking stages lets the network learn its own intermediate representations — edges, then shapes, then objects — which is the entire payoff of the word "deep".
Everything in the later pillars is a neural network in a trench coat. A transformer (the engine behind every LLM you'll touch in /agents and /rag) is a specific arrangement of the exact pieces in this lesson: matrix multiplies, nonlinear activations, and a softmax at the end. When /finetuning talks about "updating the weights" or LoRA "adapting a subset of parameters", these are the weights and parameters defined here. When /rag/chunking-and-embeddings says retrieval "searches over embeddings", those embeddings are dense vectors produced by a neural network — and this lesson is where the word stops being a black box.
Interviews silently assume you can: explain why a deep net isn't just one big linear regression; say what a "forward pass" computes; distinguish a parameter (learned) from a hyperparameter (chosen); and not confuse logits with probabilities. Get these wrong and you signal "I've used the API but never looked inside." This lesson is the inside.
The words first.
bias number. This is the "mix" step.ReLU, which just turns negatives into 0). Without it, stacking neurons would only ever produce straight-line math, so this is what lets networks learn bends and curves.Step by step.
embedding.weights and sums them, adding the bias.activation function to get the neuron's activation.Remember this: a neural network is just layers of neurons doing "weighted sum then bend," passing numbers forward until the last layer is the answer — that journey is the forward pass.
The atom is the neuron (the perceptron, in its original 1958 form). It takes an input vector x = [x₁, x₂, …, xₙ], multiplies each component by a weight wᵢ, sums them, adds a bias b, then passes the result through an activation function φ:
z = w·x + b = (w₁x₁ + w₂x₂ + … + wₙxₙ) + bEvery symbol in plain words: w is the weight vector (how much each input matters), x is the input vector (the numbers coming in), b is a single bias value (a threshold shifter), and z is the weighted sum (the "pre-activation" raw score).
Here's a concrete walk-through. Suppose you have two inputs: x = [2.0, 3.0]. The weights are w = [0.5, 0.2] and the bias is b = 1.0. Then:
0.5 * 2.0 = 1.0, 0.2 * 3.0 = 0.6.1.0 + 0.6 = 1.6.1.6 + 1.0 = 2.6.So z = 2.6 — that's the raw score. Next, you squash it through an activation function (like ReLU) to get the neuron's actual output a. This equation is the entire neuron: a dot product (how aligned is your input to your weights) plus a shift (the bias), producing one number.
a = φ(z)
Symbol check: `x` is the input (given). `w` and `b` are **parameters** — the numbers learning will tune. `z` is the **pre-activation** (a raw score, also loosely called a logit at the final layer). `φ` is a fixed nonlinear function you choose. `a` is the neuron's **activation** — its output for *this* input.
Concrete micro-example. Let `x = [2.0, -1.0]`, `w = [0.5, -1.5]`, `b = 1.0`, and `φ = ReLU` (defined below):
z = 0.5·2.0 + (-1.5)·(-1.0) + 1.0 = 1.0 + 1.5 + 1.0 = 3.5 a = ReLU(3.5) = 3.5
That's the whole neuron: a dot product, a shift, a squash. The dot product `w·x` measures "how much does this input point in the direction my weights care about", the bias shifts the threshold, and `φ` decides whether and how strongly the neuron "fires".
### 3.2 Stacking into layers — the MLP — and why nonlinearity is non-negotiable
Put many neurons side by side, all reading the same input, and you get a **layer**. Stack `m` neurons reading an `n`-dim input and their weights become a single matrix `W` of shape `m×n`; the whole layer is one matrix multiply plus a bias vector plus an elementwise activation:
a = φ(W x + b) # W: m×n, x: n, b: m, a: m
Chain layers — feed one layer's output into the next — and you have a **multilayer perceptron (MLP)**, also called a fully-connected or feedforward network. With layers `1…L`:
a⁽¹⁾ = φ(W⁽¹⁾ x + b⁽¹⁾) a⁽²⁾ = φ(W⁽²⁾ a⁽¹⁾ + b⁽²⁾) … ŷ = W⁽ᴸ⁾ a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾ # final layer often has no φ (or a softmax)
Running input forward through this chain to get `ŷ` is the **forward pass**. Notice what it actually is: a sequence of matrix multiplies (GEMMs, in systems terms) separated by cheap elementwise functions. The layers between input and output are **hidden layers**; their size (number of neurons) is the **width**, and the number of layers is the **depth**.
Now the crucial part. **Why must `φ` be nonlinear?** Suppose you dropped it (φ = identity). Then two stacked layers compute:
W⁽²⁾(W⁽¹⁾x + b⁽¹⁾) + b⁽²⁾ = (W⁽²⁾W⁽¹⁾)x + (W⁽²⁾b⁽¹⁾ + b⁽²⁾) = W' x + b'
Walk through what happens when you remove the nonlinearity and chain two matrix multiplies. Suppose x = 1.0 (scalar for clarity), and layer 1 has W⁽¹⁾ = 2.0, b⁽¹⁾ = 0.5, and layer 2 has W⁽²⁾ = 3.0, b⁽²⁾ = 0.1.
If there's no activation between them (no φ):
W⁽¹⁾ * x + b⁽¹⁾ = 2.0 * 1.0 + 0.5 = 2.5.2.5, so layer 2 output: W⁽²⁾ * 2.5 + b⁽²⁾ = 3.0 * 2.5 + 0.1 = 7.6.But look what you can combine: W⁽²⁾ * (W⁽¹⁾ * x + b⁽¹⁾) + b⁽²⁾ = (W⁽²⁾ * W⁽¹⁾) * x + (W⁽²⁾ * b⁽¹⁾ + b⁽²⁾) = (3.0 * 2.0) * x + (3.0 * 0.5 + 0.1) = 6.0 * x + 1.6. That's just one matrix (6.0) and one bias (1.6) — two layers became one. A hundred linear layers would do the same. The nonlinearity φ between them is what breaks this algebraic collapse and actually lets depth buy you expressive power.
The product of two matrices is just another matrix. A hundred linear layers collapse into a *single* linear layer `W'x + b'` — no more expressive than plain [linear regression](/ml-foundations/what-is-ml), able to carve the input space only with straight hyperplanes. The nonlinearity between layers is exactly what stops this collapse and lets depth buy you genuinely more expressive functions. (The formal version is the **universal approximation theorem**: an MLP with one hidden layer and a nonlinearity can approximate any continuous function — given enough width.)
### 3.3 The activation zoo
Different `φ`s for different jobs. You only need to recognize four or five.
| Activation | Formula | Output range | Where it's used |
|---|---|---|---|
| **ReLU** | `max(0, z)` | `[0, ∞)` | default for hidden layers (CNNs, MLPs); cheap, trains well |
| **GELU** | `z · Φ(z)` (Φ = Gaussian CDF) | `≈(-0.17, ∞)` | hidden layers in **transformers**; a smooth ReLU |
| **Sigmoid** | `1 / (1 + e⁻ᶻ)` | `(0, 1)` | a single binary probability; "gates" in LSTMs |
| **Tanh** | `(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)` | `(−1, 1)` | older RNN hidden states |
| **Softmax** | `e^{zᵢ} / Σⱼ e^{zⱼ}` | `(0,1)`, sums to 1 | the **output** layer for multi-class problems |
ReLU dominates hidden layers because it's trivially cheap and doesn't "saturate" for positive inputs — sigmoid/tanh flatten out at the extremes, so their gradient (the [slope that training follows](/ml-foundations/how-models-learn)) shrinks toward zero and learning stalls; this is the **vanishing gradient** problem. **Softmax** is special: it takes a *vector* of raw scores and turns it into a probability distribution — every element in `(0,1)`, all summing to 1. An LLM predicting the next token runs a softmax over ~100k vocabulary scores to get "probability of each possible next token". The raw scores feeding into that softmax (or sigmoid) are the **logits** — unnormalized, can be any real number, not yet probabilities.
### 3.4 What "deep" actually buys you: representation learning
Why stack *many* layers instead of one very wide one? Because depth lets the network build **hierarchical representations** — each layer learns features composed from the previous layer's features. In a vision network the pattern is famous and literally visualizable:
- **Early layers** detect edges and color gradients.
- **Middle layers** combine edges into textures, corners, simple shapes.
- **Late layers** combine those into object parts and whole concepts (a wheel, a face, a cat).
Nobody programmed "edge detector" — gradient descent discovered that edges are a useful intermediate representation for the final task. This is the headline idea of deep learning, often called **representation learning**: instead of you hand-engineering features (the pre-2012 ML workflow), the network *learns* the features. For text, the same hierarchy shows up as lower layers capturing local syntax and higher layers capturing meaning and reference. This is why "deep" and not just "wide": a wide-shallow net can memorize but struggles to *compose*, while depth lets later layers reuse and recombine what earlier ones found.
### 3.5 Embeddings: from one-hot to learned dense vectors
Networks eat numbers, but much of the world is discrete symbols — words, user IDs, product SKUs. The naive encoding is **one-hot**: a vocabulary of 50,000 words becomes a 50,000-long vector that is all zeros except a single 1. This is huge, wasteful, and — worse — *geometrically meaningless*: "cat" and "dog" are exactly as far apart as "cat" and "tractor" (every pair is orthogonal).
An **embedding** fixes this. It's a learned lookup table: an `embedding matrix E` of shape `vocab × d` (say `50000 × 256`). Token `i`'s embedding is just **row `i` of `E`** — a **dense** `d`-dimensional vector of real numbers. (Multiplying a one-hot vector by `E` selects exactly that row, which is why an embedding layer is literally a row lookup.) These rows are parameters, tuned by training, so the network arranges them in space such that *similar things land near each other*: after training, "cat" and "dog" end up close, and famously `vec("king") − vec("man") + vec("woman") ≈ vec("queen")`.
This is the direct bridge to retrieval. When [/rag/chunking-and-embeddings](/rag/chunking-and-embeddings) says a vector database "searches over embeddings", it means: encode every document chunk into a dense vector, encode the query the same way, and return the chunks whose vectors are closest (by cosine similarity / dot product). The whole premise of semantic search is that a neural net learned to put related meanings near each other in vector space. An embedding is a *one-hot turned into geography*.
### 3.6 Two famous architectures, and why we needed attention
The MLP treats inputs as a flat bag of numbers. Two specialized designs exploit *structure*, and their limits set up the jump to transformers.
**CNNs (Convolutional Neural Networks) — for grids like images.** Instead of every neuron reading every pixel (which would mean billions of weights for a photo), a CNN slides a small **filter** — a tiny weight matrix, say 3×3 — across the whole image, computing the same dot product at every position. This is **weight sharing**: the same parameters are reused everywhere, which (a) slashes parameter count and (b) bakes in **translation invariance** — a cat detector works wherever the cat is. Stack convolutional layers and you get exactly the edges→parts→objects hierarchy from §3.4. CNNs (AlexNet, 2012) kicked off the deep-learning era in vision.
**RNNs (Recurrent Neural Networks) — for sequences like text.** An RNN reads tokens one at a time, maintaining a **hidden state** that carries a summary of everything seen so far:
hₜ = φ(Wₓ xₜ + Wₕ hₜ₋₁ + b) # new state depends on current input and previous state
Elegant, but it struggled badly on long sequences for two reasons. **First, it's inherently sequential** — `hₜ` needs `hₜ₋₁` — so you can't parallelize across the sequence, which is brutal on modern GPUs. **Second, vanishing/exploding gradients**: information from far-back tokens has to survive being multiplied through many steps, and the signal decays (or blows up), so the network effectively forgets long-range context. LSTMs and GRUs (RNNs with explicit "gates", using sigmoids to decide what to keep/forget) mitigated this but never fully solved it.
The fix was **attention**: let every position look *directly* at every other position in one parallel step, with no information bottleneck through a single hidden state. That idea, scaled up, is the **transformer** — covered in [/transformers](/transformers), and the foundation of every modern LLM. RNNs are why that lesson exists.
### 3.7 The vocabulary interviewers assume
| Term | What it is | When it changes |
|---|---|---|
| **Parameters** | the **weights `W`** and **biases `b`** the model learns | every training step |
| **Hyperparameters** | knobs *you* set before/around training: learning rate, depth, width, activation choice, batch size | you tune them by hand / search |
| **Weights** | the parameter values themselves | frozen during inference |
| **Activations** | a layer's outputs for one specific input | every input (ephemeral) |
| **Logits** | the final layer's raw scores, before softmax/sigmoid | every input |
One-line mnemonic: **parameters are learned, hyperparameters are chosen; weights are fixed at inference, activations and logits are recomputed per input.**
Inputs flow left to right through weighted connections and nonlinear neurons; the activation wave is one forward pass. Training nudges the weights so the output matches the target.
A two-layer MLP that computes XOR — the canonical task a linear model provably cannot do, which is precisely why it needs a hidden layer with a nonlinearity. Weights here are hand-picked so we can read them; normally training finds them.
import numpy as np
def relu(z):
return np.maximum(0.0, z) # the nonlinearity: kills negatives, keeps positives
# Layer 1: 2 inputs -> 2 hidden units. W1 is shape (2, 2), b1 is shape (2,)
W1 = np.array([[1.0, 1.0], # hidden unit 1 fires on "at least one input on"
[1.0, 1.0]])
b1 = np.array([0.0, -1.0]) # unit 2's bias makes it fire only when BOTH are on
# Layer 2: 2 hidden -> 1 output. W2 is shape (1, 2)
W2 = np.array([[1.0, -2.0]]) # output = (input on?) minus 2*(both on?)
b2 = np.array([0.0])
def forward(x):
h = relu(W1 @ x + b1) # hidden activations: matmul + bias + nonlinearity
logit = W2 @ h + b2 # output layer: raw score, no activation
return h, logit
for x in ([0, 0], [0, 1], [1, 0], [1, 1]):
h, logit = forward(np.array(x, dtype=float))
print(x, "hidden:", h, "-> xor:", round(float(logit[0])))Line by line: relu is our φ. W1 @ x + b1 is the §3.2 layer equation — @ is matrix-multiply — and wrapping it in relu is what makes the network nonlinear. The forward function is the forward pass: two matmuls separated by one activation. Running it prints 0,1,1,0 — correct XOR. Delete the relu (make it linear) and no choice of W1,W2 can ever reproduce XOR; that's §3.2's collapse, made concrete.
Section 1: Setup (lines 179-189). Define relu (the nonlinearity), then hand-craft weights W1 and b1 for the hidden layer. W1 @ x does the matrix-vector multiply; + b1 shifts the result. Similarly, W2 and b2 are the output layer's weights and bias. These weights are carefully chosen (not learned) so we can see the arithmetic clearly.
Section 2: The forward pass (lines 191-194). The function forward(x) runs two stages:
h = relu(W1 @ x + b1) — compute the hidden layer by doing matmul, add bias, then apply relu (which zeros out negatives). This is exactly the neuron equation from §3.1, applied to all two hidden neurons at once.logit = W2 @ h + b2 — feed the hidden layer's output into the output layer (another matmul and bias), but no activation at the end, so logit is just a raw score.Section 3: Run and verify (lines 196-198). For each of the four possible inputs to XOR ([0,0], [0,1], [1,0], [1,1]), compute the forward pass and print the result. Because relu is there to provide nonlinearity, the network correctly outputs [0, 1, 1, 0]. But remove relu (make it the identity function) and no matrix W1, W2 in the world can solve XOR — that's the algebraic collapse from §3.2 in action.
To see logits become probabilities, here is softmax on raw scores:
def softmax(z):Line 206-207: Subtract the max for stability. Exponentials grow very fast. If z contains large numbers (say, 100), e^100 overflows. The trick z - z.max() shifts everything down so the largest element becomes 0 (and e^0 = 1). This doesn't change the final probabilities because softmax is scale-invariant in the exponent: e^{z_i} / Σ e^{z_j} is the same as e^{z_i - c} / Σ e^{z_j - c} for any constant c. We pick c = max(z).
Line 208: Normalize. e.sum() is the sum of all exponentiated values. Dividing each by this sum converts the exponentials into a proper distribution that sums to 1. Every element is now in (0, 1), and they're proportional to the original exponentials — so the biggest logit gets the biggest probability, but everything is smooth (differentiable), so gradients can flow backward during training.
e = np.exp(z - z.max()) # subtract max for numerical stability
return e / e.sum()logits = np.array([2.0, 1.0, 0.1]) # raw output-layer scores for 3 classes
Softmax turns raw logits into a valid probability distribution. Here z = [2.0, 1.0, 0.1] — three scores for three classes.
Step by step:
e^z for each: e^2.0 ≈ 7.39, e^1.0 ≈ 2.72, e^0.1 ≈ 1.11.7.39 + 2.72 + 1.11 = 11.22.[7.39/11.22, 2.72/11.22, 1.11/11.22] ≈ [0.659, 0.242, 0.099].Each output is now in (0, 1) and they sum to 1.0 — a valid probability distribution. The largest logit (2.0) becomes the largest probability (0.659), but every class keeps some mass. That's why it's "soft" — a smoother, differentiable cousin of argmax that lets gradients flow during training.
print(softmax(logits)) # -> [0.659, 0.242, 0.099], sums to 1.0
The logits are arbitrary real numbers; softmax turns them into a distribution. The largest logit gets the largest probability — but every class keeps some mass (it's *soft*, not `argmax`). In a real net you'd build the same thing with `torch.nn.Linear` layers and `torch.nn.ReLU`, but the arithmetic is exactly the numpy above.
Embeddings place meaning in geometry: similar things sit close together. A query lands somewhere, and 'retrieval' is just finding its nearest neighbours — the engine under RAG and semantic search.
ŷ = fₗ(…f₂(f₁(x))), where each fᵢ(v) = φ(Wᵢv + bᵢ). The forward pass is just evaluating a deeply nested function; backprop (training) is the chain rule applied to that nesting.(−∞, ∞). Softmax (multi-class) or sigmoid (binary) converts them; only then do they sum/squash to valid probabilities.φ, more parameters still collapse to one linear map.argmax picks the max; softmax is its smooth, differentiable cousin (which is why training can flow through it).W₂(W₁x) = (W₂W₁)x, just another single linear map — no more powerful than linear regression, able to draw only straight boundaries. A nonlinearity between layers prevents the collapse and lets depth represent genuinely complex (curved) functions.W,b — fixed during inference. Activations are a layer's outputs for one specific input — recomputed for every input. Logits are the final layer's raw pre-softmax scores — also per-input. So activations and logits change at inference; weights don't (unless you're training/fine-tuning).hₜ depends on hₜ₋₁, so no parallelism) and suffer vanishing/exploding gradients, so long-range information decays through the chain. Attention lets every position read every other position directly in one parallel step — no single-state bottleneck — which scales on GPUs and preserves long-range dependencies. That's the transformer.You're ready to move on when you can sketch a forward pass as matmul→activation→matmul, explain in one breath why the nonlinearity matters, and correctly use the words weights, activations, logits, embedding, parameter, and hyperparameter without hedging.
torch.nn docs (pytorch.org): how Linear, ReLU, GELU, and Softmax map to the math above.Next: NLP and the Road to LLMs takes embeddings and the RNN dead-end and shows how language got modeled — then /transformers builds the architecture that powers everything in /agents and /rag.