PrereqIC3

Neural Networks and Deep Learning

A neural network is just a stack of matrix multiplies glued together by nonlinear "switches" — once you see it as composed functions with learnable constants, "deep learning", embeddings, and the road to transformers stop being magic.

13 min read · 15 sections

Prerequisites: /ml-foundations/math-for-ml, /ml-foundations/how-models-learn, vectors and matrix multiply, what a parameter is

1. The one-sentence intuition

A neural network is a long pipeline of matmul → add bias → squash steps, where the matrices are the learnable constants and "training" is just tuning those constants so the pipeline maps inputs to the outputs you want. If you've ever written a data-transformation pipeline — parse, normalize, project, classify — you already have the shape; the twist is that instead of you writing each stage's logic, gradient descent fills in the numbers. The single non-obvious ingredient is a nonlinearity between stages: without it, the whole stack algebraically collapses into one matrix and can only draw straight lines. With it, stacking stages lets the network learn its own intermediate representations — edges, then shapes, then objects — which is the entire payoff of the word "deep".

2. Why a software engineer needs this

Everything in the later pillars is a neural network in a trench coat. A transformer (the engine behind every LLM you'll touch in /agents and /rag) is a specific arrangement of the exact pieces in this lesson: matrix multiplies, nonlinear activations, and a softmax at the end. When /finetuning talks about "updating the weights" or LoRA "adapting a subset of parameters", these are the weights and parameters defined here. When /rag/chunking-and-embeddings says retrieval "searches over embeddings", those embeddings are dense vectors produced by a neural network — and this lesson is where the word stops being a black box.

Interviews silently assume you can: explain why a deep net isn't just one big linear regression; say what a "forward pass" computes; distinguish a parameter (learned) from a hyperparameter (chosen); and not confuse logits with probabilities. Get these wrong and you signal "I've used the API but never looked inside." This lesson is the inside.

3. Build it up from scratch

Beginner explainerNew here? The words first

The words first.

Neuron — one tiny computing unit. It takes some numbers in, mixes them, and puts one number out.
Weight — a number that says how much one input matters. A learned setting the network tunes during training; it does NOT change while you use the model.
Activation — the number a neuron actually outputs for the specific input you fed in right now. Weights are fixed; activations change with every input.
Weighted sum — multiply each input by its weight, add the results together, plus one extra bias number. This is the "mix" step.
Activation function (nonlinearity) — a simple curve applied to the weighted sum (e.g. ReLU, which just turns negatives into 0). Without it, stacking neurons would only ever produce straight-line math, so this is what lets networks learn bends and curves.
Layer — a row of many neurons all reading the same inputs at once. Stack layers so each layer's outputs feed the next.
Embedding — a list of numbers (a vector) that represents a thing like a word or image, so the network can do math on it. Similar things get similar number-lists.

Step by step.

Turn the raw input (text, pixels) into numbers, often an embedding.
Each neuron multiplies those numbers by its weights and sums them, adding the bias.
Push that sum through the activation function to get the neuron's activation.
Collect every neuron's activation — that is the layer's output.
Feed those outputs into the next layer; repeat through all layers.
The final layer's numbers are the answer (a prediction or score).

Remember this: a neural network is just layers of neurons doing "weighted sum then bend," passing numbers forward until the last layer is the answer — that journey is the forward pass.

3.1 The neuron: a weighted sum plus a switch

The atom is the neuron (the perceptron, in its original 1958 form). It takes an input vector x = [x₁, x₂, …, xₙ], multiplies each component by a weight wᵢ, sums them, adds a bias b, then passes the result through an activation function φ:

z = w·x + b = (w₁x₁ + w₂x₂ + … + wₙxₙ) + b

✎ The neuron equation — on real numbers

Every symbol in plain words: w is the weight vector (how much each input matters), x is the input vector (the numbers coming in), b is a single bias value (a threshold shifter), and z is the weighted sum (the "pre-activation" raw score).

Here's a concrete walk-through. Suppose you have two inputs: x = [2.0, 3.0]. The weights are w = [0.5, 0.2] and the bias is b = 1.0. Then:

Multiply each input by its weight: 0.5 * 2.0 = 1.0, 0.2 * 3.0 = 0.6.
Add them: 1.0 + 0.6 = 1.6.
Add the bias: 1.6 + 1.0 = 2.6.

So z = 2.6 — that's the raw score. Next, you squash it through an activation function (like ReLU) to get the neuron's actual output a. This equation is the entire neuron: a dot product (how aligned is your input to your weights) plus a shift (the bias), producing one number.

a = φ(z)

 
Symbol check: `x` is the input (given). `w` and `b` are **parameters** — the numbers learning will tune. `z` is the **pre-activation** (a raw score, also loosely called a logit at the final layer). `φ` is a fixed nonlinear function you choose. `a` is the neuron's **activation** — its output for *this* input.
 
Concrete micro-example. Let `x = [2.0, -1.0]`, `w = [0.5, -1.5]`, `b = 1.0`, and `φ = ReLU` (defined below):

z = 0.5·2.0 + (-1.5)·(-1.0) + 1.0 = 1.0 + 1.5 + 1.0 = 3.5 a = ReLU(3.5) = 3.5

 
That's the whole neuron: a dot product, a shift, a squash. The dot product `w·x` measures "how much does this input point in the direction my weights care about", the bias shifts the threshold, and `φ` decides whether and how strongly the neuron "fires".
 
### 3.2 Stacking into layers — the MLP — and why nonlinearity is non-negotiable
 
Put many neurons side by side, all reading the same input, and you get a **layer**. Stack `m` neurons reading an `n`-dim input and their weights become a single matrix `W` of shape `m×n`; the whole layer is one matrix multiply plus a bias vector plus an elementwise activation:

a = φ(W x + b) # W: m×n, x: n, b: m, a: m

 
Chain layers — feed one layer's output into the next — and you have a **multilayer perceptron (MLP)**, also called a fully-connected or feedforward network. With layers `1…L`:

a⁽¹⁾ = φ(W⁽¹⁾ x + b⁽¹⁾) a⁽²⁾ = φ(W⁽²⁾ a⁽¹⁾ + b⁽²⁾) … ŷ = W⁽ᴸ⁾ a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾ # final layer often has no φ (or a softmax)

 
Running input forward through this chain to get `ŷ` is the **forward pass**. Notice what it actually is: a sequence of matrix multiplies (GEMMs, in systems terms) separated by cheap elementwise functions. The layers between input and output are **hidden layers**; their size (number of neurons) is the **width**, and the number of layers is the **depth**.
 
Now the crucial part. **Why must `φ` be nonlinear?** Suppose you dropped it (φ = identity). Then two stacked layers compute:

W⁽²⁾(W⁽¹⁾x + b⁽¹⁾) + b⁽²⁾ = (W⁽²⁾W⁽¹⁾)x + (W⁽²⁾b⁽¹⁾ + b⁽²⁾) = W' x + b'

✎ Why stacking linear layers collapses — on real numbers

Walk through what happens when you remove the nonlinearity and chain two matrix multiplies. Suppose x = 1.0 (scalar for clarity), and layer 1 has W⁽¹⁾ = 2.0, b⁽¹⁾ = 0.5, and layer 2 has W⁽²⁾ = 3.0, b⁽²⁾ = 0.1.

If there's no activation between them (no φ):

Layer 1 output: W⁽¹⁾ * x + b⁽¹⁾ = 2.0 * 1.0 + 0.5 = 2.5.
Layer 2 input is now 2.5, so layer 2 output: W⁽²⁾ * 2.5 + b⁽²⁾ = 3.0 * 2.5 + 0.1 = 7.6.

But look what you can combine: W⁽²⁾ * (W⁽¹⁾ * x + b⁽¹⁾) + b⁽²⁾ = (W⁽²⁾ * W⁽¹⁾) * x + (W⁽²⁾ * b⁽¹⁾ + b⁽²⁾) = (3.0 * 2.0) * x + (3.0 * 0.5 + 0.1) = 6.0 * x + 1.6. That's just one matrix (6.0) and one bias (1.6) — two layers became one. A hundred linear layers would do the same. The nonlinearity φ between them is what breaks this algebraic collapse and actually lets depth buy you expressive power.

 
The product of two matrices is just another matrix. A hundred linear layers collapse into a *single* linear layer `W'x + b'` — no more expressive than plain [linear regression](/ml-foundations/what-is-ml), able to carve the input space only with straight hyperplanes. The nonlinearity between layers is exactly what stops this collapse and lets depth buy you genuinely more expressive functions. (The formal version is the **universal approximation theorem**: an MLP with one hidden layer and a nonlinearity can approximate any continuous function — given enough width.)
 
### 3.3 The activation zoo
 
Different `φ`s for different jobs. You only need to recognize four or five.
 
| Activation | Formula | Output range | Where it's used |
|---|---|---|---|
| **ReLU** | `max(0, z)` | `[0, ∞)` | default for hidden layers (CNNs, MLPs); cheap, trains well |
| **GELU** | `z · Φ(z)` (Φ = Gaussian CDF) | `≈(-0.17, ∞)` | hidden layers in **transformers**; a smooth ReLU |
| **Sigmoid** | `1 / (1 + e⁻ᶻ)` | `(0, 1)` | a single binary probability; "gates" in LSTMs |
| **Tanh** | `(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)` | `(−1, 1)` | older RNN hidden states |
| **Softmax** | `e^{zᵢ} / Σⱼ e^{zⱼ}` | `(0,1)`, sums to 1 | the **output** layer for multi-class problems |
 
ReLU dominates hidden layers because it's trivially cheap and doesn't "saturate" for positive inputs — sigmoid/tanh flatten out at the extremes, so their gradient (the [slope that training follows](/ml-foundations/how-models-learn)) shrinks toward zero and learning stalls; this is the **vanishing gradient** problem. **Softmax** is special: it takes a *vector* of raw scores and turns it into a probability distribution — every element in `(0,1)`, all summing to 1. An LLM predicting the next token runs a softmax over ~100k vocabulary scores to get "probability of each possible next token". The raw scores feeding into that softmax (or sigmoid) are the **logits** — unnormalized, can be any real number, not yet probabilities.
 
### 3.4 What "deep" actually buys you: representation learning
 
Why stack *many* layers instead of one very wide one? Because depth lets the network build **hierarchical representations** — each layer learns features composed from the previous layer's features. In a vision network the pattern is famous and literally visualizable:
 
- **Early layers** detect edges and color gradients.
- **Middle layers** combine edges into textures, corners, simple shapes.
- **Late layers** combine those into object parts and whole concepts (a wheel, a face, a cat).
 
Nobody programmed "edge detector" — gradient descent discovered that edges are a useful intermediate representation for the final task. This is the headline idea of deep learning, often called **representation learning**: instead of you hand-engineering features (the pre-2012 ML workflow), the network *learns* the features. For text, the same hierarchy shows up as lower layers capturing local syntax and higher layers capturing meaning and reference. This is why "deep" and not just "wide": a wide-shallow net can memorize but struggles to *compose*, while depth lets later layers reuse and recombine what earlier ones found.
 
### 3.5 Embeddings: from one-hot to learned dense vectors
 
Networks eat numbers, but much of the world is discrete symbols — words, user IDs, product SKUs. The naive encoding is **one-hot**: a vocabulary of 50,000 words becomes a 50,000-long vector that is all zeros except a single 1. This is huge, wasteful, and — worse — *geometrically meaningless*: "cat" and "dog" are exactly as far apart as "cat" and "tractor" (every pair is orthogonal).
 
An **embedding** fixes this. It's a learned lookup table: an `embedding matrix E` of shape `vocab × d` (say `50000 × 256`). Token `i`'s embedding is just **row `i` of `E`** — a **dense** `d`-dimensional vector of real numbers. (Multiplying a one-hot vector by `E` selects exactly that row, which is why an embedding layer is literally a row lookup.) These rows are parameters, tuned by training, so the network arranges them in space such that *similar things land near each other*: after training, "cat" and "dog" end up close, and famously `vec("king") − vec("man") + vec("woman") ≈ vec("queen")`.
 
This is the direct bridge to retrieval. When [/rag/chunking-and-embeddings](/rag/chunking-and-embeddings) says a vector database "searches over embeddings", it means: encode every document chunk into a dense vector, encode the query the same way, and return the chunks whose vectors are closest (by cosine similarity / dot product). The whole premise of semantic search is that a neural net learned to put related meanings near each other in vector space. An embedding is a *one-hot turned into geography*.
 
### 3.6 Two famous architectures, and why we needed attention
 
The MLP treats inputs as a flat bag of numbers. Two specialized designs exploit *structure*, and their limits set up the jump to transformers.
 
**CNNs (Convolutional Neural Networks) — for grids like images.** Instead of every neuron reading every pixel (which would mean billions of weights for a photo), a CNN slides a small **filter** — a tiny weight matrix, say 3×3 — across the whole image, computing the same dot product at every position. This is **weight sharing**: the same parameters are reused everywhere, which (a) slashes parameter count and (b) bakes in **translation invariance** — a cat detector works wherever the cat is. Stack convolutional layers and you get exactly the edges→parts→objects hierarchy from §3.4. CNNs (AlexNet, 2012) kicked off the deep-learning era in vision.
 
**RNNs (Recurrent Neural Networks) — for sequences like text.** An RNN reads tokens one at a time, maintaining a **hidden state** that carries a summary of everything seen so far:

hₜ = φ(Wₓ xₜ + Wₕ hₜ₋₁ + b) # new state depends on current input and previous state

 
Elegant, but it struggled badly on long sequences for two reasons. **First, it's inherently sequential** — `hₜ` needs `hₜ₋₁` — so you can't parallelize across the sequence, which is brutal on modern GPUs. **Second, vanishing/exploding gradients**: information from far-back tokens has to survive being multiplied through many steps, and the signal decays (or blows up), so the network effectively forgets long-range context. LSTMs and GRUs (RNNs with explicit "gates", using sigmoids to decide what to keep/forget) mitigated this but never fully solved it.
 
The fix was **attention**: let every position look *directly* at every other position in one parallel step, with no information bottleneck through a single hidden state. That idea, scaled up, is the **transformer** — covered in [/transformers](/transformers), and the foundation of every modern LLM. RNNs are why that lesson exists.
 
### 3.7 The vocabulary interviewers assume
 
| Term | What it is | When it changes |
|---|---|---|
| **Parameters** | the **weights `W`** and **biases `b`** the model learns | every training step |
| **Hyperparameters** | knobs *you* set before/around training: learning rate, depth, width, activation choice, batch size | you tune them by hand / search |
| **Weights** | the parameter values themselves | frozen during inference |
| **Activations** | a layer's outputs for one specific input | every input (ephemeral) |
| **Logits** | the final layer's raw scores, before softmax/sigmoid | every input |
 
One-line mnemonic: **parameters are learned, hyperparameters are chosen; weights are fixed at inference, activations and logits are recomputed per input.**

◇ Live illustrationA neural network's forward pass

Inputs flow left to right through weighted connections and nonlinear neurons; the activation wave is one forward pass. Training nudges the weights so the output matches the target.

4. See it in code

A two-layer MLP that computes XOR — the canonical task a linear model provably cannot do, which is precisely why it needs a hidden layer with a nonlinearity. Weights here are hand-picked so we can read them; normally training finds them.

import numpy as np
 
def relu(z):
    return np.maximum(0.0, z)           # the nonlinearity: kills negatives, keeps positives
 
# Layer 1: 2 inputs -> 2 hidden units.  W1 is shape (2, 2), b1 is shape (2,)
W1 = np.array([[1.0, 1.0],              # hidden unit 1 fires on "at least one input on"
               [1.0, 1.0]])
b1 = np.array([0.0, -1.0])              # unit 2's bias makes it fire only when BOTH are on
 
# Layer 2: 2 hidden -> 1 output.  W2 is shape (1, 2)
W2 = np.array([[1.0, -2.0]])            # output = (input on?) minus 2*(both on?)
b2 = np.array([0.0])
 
def forward(x):
    h = relu(W1 @ x + b1)               # hidden activations: matmul + bias + nonlinearity
    logit = W2 @ h + b2                 # output layer: raw score, no activation
    return h, logit
 
for x in ([0, 0], [0, 1], [1, 0], [1, 1]):
    h, logit = forward(np.array(x, dtype=float))
    print(x, "hidden:", h, "-> xor:", round(float(logit[0])))

Line by line: relu is our φ. W1 @ x + b1 is the §3.2 layer equation — @ is matrix-multiply — and wrapping it in relu is what makes the network nonlinear. The forward function is the forward pass: two matmuls separated by one activation. Running it prints 0,1,1,0 — correct XOR. Delete the relu (make it linear) and no choice of W1,W2 can ever reproduce XOR; that's §3.2's collapse, made concrete.

✎ XOR forward pass, section by section

Section 1: Setup (lines 179-189). Define relu (the nonlinearity), then hand-craft weights W1 and b1 for the hidden layer. W1 @ x does the matrix-vector multiply; + b1 shifts the result. Similarly, W2 and b2 are the output layer's weights and bias. These weights are carefully chosen (not learned) so we can see the arithmetic clearly.

Section 2: The forward pass (lines 191-194). The function forward(x) runs two stages:

Line 192: h = relu(W1 @ x + b1) — compute the hidden layer by doing matmul, add bias, then apply relu (which zeros out negatives). This is exactly the neuron equation from §3.1, applied to all two hidden neurons at once.
Line 193: logit = W2 @ h + b2 — feed the hidden layer's output into the output layer (another matmul and bias), but no activation at the end, so logit is just a raw score.

Section 3: Run and verify (lines 196-198). For each of the four possible inputs to XOR ([0,0], [0,1], [1,0], [1,1]), compute the forward pass and print the result. Because relu is there to provide nonlinearity, the network correctly outputs [0, 1, 1, 0]. But remove relu (make it the identity function) and no matrix W1, W2 in the world can solve XOR — that's the algebraic collapse from §3.2 in action.

To see logits become probabilities, here is softmax on raw scores:

def softmax(z):

✎ Softmax implementation, section by section

Line 206-207: Subtract the max for stability. Exponentials grow very fast. If z contains large numbers (say, 100), e^100 overflows. The trick z - z.max() shifts everything down so the largest element becomes 0 (and e^0 = 1). This doesn't change the final probabilities because softmax is scale-invariant in the exponent: e^{z_i} / Σ e^{z_j} is the same as e^{z_i - c} / Σ e^{z_j - c} for any constant c. We pick c = max(z).

Line 208: Normalize. e.sum() is the sum of all exponentiated values. Dividing each by this sum converts the exponentials into a proper distribution that sums to 1. Every element is now in (0, 1), and they're proportional to the original exponentials — so the biggest logit gets the biggest probability, but everything is smooth (differentiable), so gradients can flow backward during training.

e = np.exp(z - z.max())             # subtract max for numerical stability
return e / e.sum()

logits = np.array([2.0, 1.0, 0.1]) # raw output-layer scores for 3 classes

✎ Softmax — on real numbers

Softmax turns raw logits into a valid probability distribution. Here z = [2.0, 1.0, 0.1] — three scores for three classes.

Step by step:

Compute e^z for each: e^2.0 ≈ 7.39, e^1.0 ≈ 2.72, e^0.1 ≈ 1.11.
Sum them: 7.39 + 2.72 + 1.11 = 11.22.
Divide each by the sum: [7.39/11.22, 2.72/11.22, 1.11/11.22] ≈ [0.659, 0.242, 0.099].

Each output is now in (0, 1) and they sum to 1.0 — a valid probability distribution. The largest logit (2.0) becomes the largest probability (0.659), but every class keeps some mass. That's why it's "soft" — a smoother, differentiable cousin of argmax that lets gradients flow during training.

print(softmax(logits)) # -> [0.659, 0.242, 0.099], sums to 1.0

 
The logits are arbitrary real numbers; softmax turns them into a distribution. The largest logit gets the largest probability — but every class keeps some mass (it's *soft*, not `argmax`). In a real net you'd build the same thing with `torch.nn.Linear` layers and `torch.nn.ReLU`, but the arithmetic is exactly the numpy above.

◇ Live illustrationThe embedding space

Embeddings place meaning in geometry: similar things sit close together. A query lands somewhere, and 'retrieval' is just finding its nearest neighbours — the engine under RAG and semantic search.

5. Mental models & SWE analogies

A network is function composition. ŷ = fₗ(…f₂(f₁(x))), where each fᵢ(v) = φ(Wᵢv + bᵢ). The forward pass is just evaluating a deeply nested function; backprop (training) is the chain rule applied to that nesting.
Weights are constants the compiler fills in. You write the architecture (the function shapes); gradient descent acts like an optimizing compiler that picks the constant values to minimize loss. You don't write the constants; you write the structure and the objective.
A gradient ≈ a profiler's attribution. Just as a profiler tells you which line to change to cut latency, a gradient tells you which weight to nudge (and which way) to cut loss. Training follows it downhill.
Embeddings are a hash that preserves meaning. A normal hash scatters similar inputs randomly; an embedding does the opposite — it maps similar inputs to nearby vectors, which is exactly what makes nearest-neighbor search semantic.
Activations are stack frames; weights are the source code. Weights are the static program (same every call); activations are the per-call locals — created fresh for each input, discarded after. This is literally why inference memory scales with batch size and sequence length, not just model size.

6. Common confusions

"Logits are probabilities." No — logits are unnormalized raw scores in (−∞, ∞). Softmax (multi-class) or sigmoid (binary) converts them; only then do they sum/squash to valid probabilities.
"Activation functions add capacity." Their job is nonlinearity, not capacity. Capacity (how complex a function you can fit) comes from parameters — width and depth. Without φ, more parameters still collapse to one linear map.
"Embeddings come from a separate algorithm." They're rows of a learned matrix (or the output of a network). Same gradient descent, same parameters — nothing exotic.
"Deeper is always better." Depth enables hierarchy but costs trainability (vanishing gradients) and data; naive deep nets overfit or fail to train without tricks (residual connections, normalization, regularization).
"Softmax picks the maximum." It produces a soft distribution — every class keeps nonzero probability. argmax picks the max; softmax is its smooth, differentiable cousin (which is why training can flow through it).
"A neuron models a brain cell." It's a loose metaphor. A neuron is a weighted sum plus a nonlinearity — pure linear algebra, no biology required.
"Parameters and hyperparameters are the same thing." Parameters are learned by the optimizer; hyperparameters (learning rate, layer count, batch size) are set by you and govern how learning happens.

7. Check yourself

[Prereq] Why do neural networks need nonlinear activations? Because stacking linear layers collapses: W₂(W₁x) = (W₂W₁)x, just another single linear map — no more powerful than linear regression, able to draw only straight boundaries. A nonlinearity between layers prevents the collapse and lets depth represent genuinely complex (curved) functions.

[Prereq] What's the difference between parameters and hyperparameters? Parameters are the weights and biases the optimizer learns from data during training. Hyperparameters are the settings you choose around training — learning rate, number/width of layers, activation, batch size — that control how learning happens. Parameters are found; hyperparameters are chosen (often by search).

[IC3] Define weights vs activations vs logits. Which change at inference time? Weights are the learned parameters W,b — fixed during inference. Activations are a layer's outputs for one specific input — recomputed for every input. Logits are the final layer's raw pre-softmax scores — also per-input. So activations and logits change at inference; weights don't (unless you're training/fine-tuning).

[IC3] What does "deep" buy you that "wide" doesn't? Hierarchical representation learning: each layer composes features from the previous one (edges → parts → objects), so later layers reuse earlier abstractions. A shallow-wide net can fit/memorize but composes poorly; depth is what lets the network build and recombine intermediate concepts.

[IC4] Why did RNNs struggle on long sequences, and how does attention fix it? RNNs are sequential (state hₜ depends on hₜ₋₁, so no parallelism) and suffer vanishing/exploding gradients, so long-range information decays through the chain. Attention lets every position read every other position directly in one parallel step — no single-state bottleneck — which scales on GPUs and preserves long-range dependencies. That's the transformer.

You're ready to move on when you can sketch a forward pass as matmul→activation→matmul, explain in one breath why the nonlinearity matters, and correctly use the words weights, activations, logits, embedding, parameter, and hyperparameter without hedging.

8. Go deeper

Stanford CS231n — Neural Networks notes (cs231n.github.io/neural-networks-1): the canonical from-scratch treatment of neurons, layers, and activations.
Dive into Deep Learning — Multilayer Perceptrons (d2l.ai): MLPs with runnable code; the best free interactive textbook.
3Blue1Brown — Neural Networks series (3blue1brown.com): the visual intuition for what a network and its weights are doing.
Goodfellow, Bengio & Courville — Deep Learning (deeplearningbook.org): the standard reference for the theory (representations, depth, architectures).
Stanford CS224n (web.stanford.edu/class/cs224n): word embeddings, RNNs, and the road to attention, for the NLP angle.
PyTorch torch.nn docs (pytorch.org): how Linear, ReLU, GELU, and Softmax map to the math above.

Next: NLP and the Road to LLMs takes embeddings and the RNN dead-end and shows how language got modeled — then /transformers builds the architecture that powers everything in /agents and /rag.

Primary sources

← More in ML Foundations (for engineers)