PrereqIC3

How Models Learn

Training is a loop that nudges millions of numbers downhill on an error surface — forward, loss, backward, step, repeat — and once you can picture that loop, words like gradient, learning rate, and AdamW stop being magic.

13 min read · 15 sections

Prerequisites: vectors & derivatives (see math-for-ml), what a model is (see what-is-ml)

Runnable: ai-eng-wiki/examples/ml/train_loop.py

1. The one-sentence intuition

"Training" is just an optimization loop: you measure how wrong the model is with a single number (the loss), figure out which direction to nudge each of the model's numbers (its parameters) to make that number smaller, take a small step in that direction, and repeat millions of times until the loss stops dropping. The direction is the gradient; stepping against it is gradient descent.

If you've ever written a feedback controller, a hill-climbing search, or a Newton-Raphson root-finder, you already have the mental model: define an objective, compute its slope, step toward the optimum, loop until convergence. A neural network is the same idea with a few billion knobs instead of one, and a clever trick (backpropagation) for computing the slope of all of them in a single pass.

2. Why a software engineer needs this

You will rarely write a training loop from scratch on the job — but every concept in this lesson is load-bearing vocabulary for the rest of AI engineering, and interviews assume you own it cold.

Fine-tuning (/finetuning) is this loop, run on your data with a pretrained model as the starting point. LoRA, QLoRA, DPO, and GRPO are all variations on "what loss do we descend, and which parameters do we update." You cannot reason about them without the loop.
Loss and gradients show up constantly in discourse: "the model collapsed," "loss spiked," "we lowered the learning rate," "gradients exploded," "we warmed up then decayed the LR." These are not metaphors — they're literal events in the loop you're about to understand.
Evals (/evals) exist precisely because the training loss is not the thing you care about. Knowing the difference between the loss you optimize and the metric you ship is half of practical ML.
Transformers (/transformers) are "just" a particular architecture plugged into this exact loop. Once the loop is second nature, the transformer is the only new thing left to learn.

Interviewers use this topic as a filter. If you can explain why cross-entropy and not MSE for classification, what the learning rate trades off, and what backprop actually computes, you sound like someone who has trained a model. If you can't, no amount of API experience hides it.

3. Build it up from scratch

Beginner explainerNew here? The words first

The words first.

Weight — one of the model's adjustable numbers (it may have millions); learning means tweaking these.
Loss — a single number that scores how wrong the model's prediction was. Lower is better; 0 would be perfect.
Forward pass — feeding an input through the model to get a prediction, then computing the loss.
Gradient — for each weight, a number saying which direction (up or down) and how strongly to change it to reduce the loss.
Backward pass (backprop) — the calculation that produces all those gradients in one sweep, working from the loss back through the model.
Learning rate — how big a step you take when nudging weights. Too big overshoots; too small crawls.
Batch — the small group of examples processed together before each weight update (e.g. 32 images at once).
Epoch — one full pass through the entire training dataset (many batches).

Step by step.

Grab a batch of examples.
Forward: run them through the model to get predictions.
Loss: compare predictions to the right answers, get one wrongness number.
Backward: compute the gradient for every weight.
Step: nudge each weight a little, opposite its gradient, scaled by the learning rate.
Repeat with the next batch until the data runs out — that finishes one epoch.
Run many epochs; the loss trends downward as the model improves.

Remember this: Training just repeats one cycle — predict, measure how wrong, find which way each weight should move, take a small step — millions of times.

3.1 The loss function: turning "wrong" into one number

A model is a function with tunable parameters: it takes an input x and produces a prediction ŷ ("y-hat"). To improve it, we first need to score it. A loss function maps a prediction and the true answer y to a single non-negative number: 0 means perfect, larger means worse. Training = making the average loss over your data as small as possible.

Two losses cover almost everything you'll meet.

Regression → Mean Squared Error (MSE). When the target is a continuous number (price, temperature), penalize the squared gap between prediction and truth:

\text{MSE} = \frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2

✎ Mean Squared Error — on real numbers

Every symbol: n is the count of examples, i picks out one example, ŷ_i is the model's prediction for example i, and y_i is the true value. You subtract truth from prediction, square it (to make every error positive), then average across all examples.

Concrete example: Suppose you have three examples with true values y = [2, 4, 6] and the model predicts ŷ = [2.5, 3.5, 6.2]. The errors are [0.5, -0.5, 0.2]. Squared: [0.25, 0.25, 0.04]. Average: (0.25 + 0.25 + 0.04) / 3 ≈ 0.18. That's the MSE — a single number measuring how far off the predictions were on average. Bigger errors hurt more: being off by 2 costs 4 in the sum, being off by 1 costs only 1.

This loss squashes all wrongness into one scalar that the optimizer can drive toward zero.

Here n is the number of examples, i indexes them, ŷ_i is the model's prediction for example i, and y_i is the true value. Squaring makes every error positive and punishes big misses far more than small ones (being off by 4 costs 16, off by 2 costs 4). It measures distance.

Classification → Cross-Entropy. When the target is a category (cat/dog, or one of 50,000 possible next tokens), the model outputs a probability for each class, and we want to reward it for putting high probability on the correct class. Cross-entropy for a single example is:

\text{CE} = -\log p_{\text{correct}}

✎ Cross-entropy for classification — on real numbers

Every symbol: p_correct is the probability the model assigned to the true class, and log is the natural logarithm (base e).

Concrete example: The model outputs probabilities for three classes; suppose the true class is class 2, and the model assigned p_2 = 0.8. Then CE = -log(0.8) ≈ -(-0.223) ≈ 0.22 — a small loss because the model was confident and correct. Now suppose the model were very wrong: p_correct = 0.01. Then CE = -log(0.01) ≈ 4.6 — huge penalty. If the model outputs p_correct = 0.5 (uncertain), CE = -log(0.5) ≈ 0.69 — medium loss. Notice: the worse the model is (lower probability on the true answer), the larger the loss. And the gradient of cross-entropy stays large exactly when the model is wrong — perfect for learning.

where p_correct is the probability the model assigned to the true class. If the model is confident and right (p = 0.99), the loss is -log(0.99) ≈ 0.01 — tiny. If it's confident and wrong (p = 0.01 on the true class), the loss is -log(0.01) ≈ 4.6 — huge. It measures surprise: how shocked the model should be by the right answer.

Two terms you'll hear constantly live right here. Logits are the raw, unbounded scores the network outputs per class (e.g. [2.0, 1.0, 0.1]). To turn logits into probabilities that sum to 1, you apply softmax: exponentiate each, then normalize.

p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}

✎ Softmax — on real numbers

Every symbol: z_k is the logit (raw model output) for class k, e is Euler's constant (~2.718), and the denominator sums the exponentials of all logits to normalize the distribution.

Concrete example: Logits from the model are [2.0, 1.0, 0.1] (three classes). First, exponentiate: e^2.0 ≈ 7.39, e^1.0 ≈ 2.72, e^0.1 ≈ 1.11. Sum them: 7.39 + 2.72 + 1.11 ≈ 11.22. Now divide each by the sum: p_1 = 7.39 / 11.22 ≈ 0.66, p_2 = 2.72 / 11.22 ≈ 0.24, p_3 = 1.11 / 11.22 ≈ 0.10. Notice the probabilities sum to 1.0 and the biggest logit gets the biggest probability, with smooth attenuation. This transformation turns raw scores into a valid probability distribution that cross-entropy can consume.

z_k is the logit for class k; the denominator sums over all classes j so the outputs form a valid probability distribution. Worked micro-example: logits [2.0, 1.0, 0.1] exponentiate to [7.39, 2.72, 1.11], sum 11.2, giving probabilities [0.66, 0.24, 0.10]. If the true class is the first, cross-entropy is -log(0.66) ≈ 0.42. If the true class were the last, it'd be -log(0.10) ≈ 2.3 — the same prediction is "good" or "bad" depending only on which class was correct.

Why not MSE for classification? MSE on probabilities barely punishes confident wrong answers and produces tiny, flat gradients when the model is very wrong (the place you most need a strong push). Cross-entropy's gradient stays large exactly when the model is badly wrong, so it learns faster and more stably. This pairing — softmax + cross-entropy — is the workhorse loss of every LLM, where "classification" means "predict the next token out of the vocabulary."

3.2 Gradient descent: which way is downhill

Now we have a number (loss) we want to minimize by adjusting parameters. Picture the loss as a landscape: the parameters are your coordinates, and the height is the loss. You're standing somewhere in the fog and want to reach the lowest valley. The strategy: feel the slope under your feet and step downhill.

The gradient, written ∇L ("grad L"), is the vector of partial derivatives of the loss with respect to every parameter. Each component answers: "if I increase this one parameter a hair, does the loss go up or down, and how steeply?" The gradient points in the direction of steepest increase, so to go downhill we step in the negative gradient direction. The update rule for a parameter θ ("theta") is:

\theta \leftarrow \theta - \eta \, \frac{\partial L}{\partial \theta}

✎ Gradient descent update — on real numbers

Every symbol: θ is one parameter (one of the millions of weights), ∂L/∂θ is the gradient — how much the loss changes if you nudge this parameter slightly — and η (eta) is the learning rate, controlling step size. The arrow ← means "replace the old value with the new value."

Concrete example: Say a weight θ = 5.0, the gradient ∂L/∂θ = 0.8 (positive, meaning increase this weight raises the loss), and learning rate η = 0.1. Then θ ← 5.0 - 0.1 * 0.8 = 5.0 - 0.08 = 4.92. The weight moved slightly downhill: since the gradient was positive (uphill), we subtract it to go downhill. If the gradient were negative (e.g. ∂L/∂θ = -2.0), then θ ← 5.0 - 0.1 * (-2.0) = 5.0 + 0.2 = 5.2 — we'd increase the weight because decreasing it goes downhill. Every parameter gets nudged: small steps, many times, in the direction that shrinks the loss.

θ is a parameter, ∂L/∂θ is the gradient component for it (the slope), and η ("eta") is the learning rate — how big a step you take. The ← means "assign." That's the entire algorithm: subtract a scaled slope from every parameter, repeat.

Worked micro-example. Fit ŷ = w·x + b to three points (1,2), (2,4), (3,6) (the true rule is y = 2x). Start at w = 0, b = 0, so every prediction is 0 and the errors (ŷ − y) are (−2, −4, −6). The MSE gradients are:

\frac{\partial L}{\partial w} = \frac{2}{n}\sum (\hat{y}_i - y_i)\,x_i, \qquad \frac{\partial L}{\partial b} = \frac{2}{n}\sum (\hat{y}_i - y_i)

Plugging in: ∂L/∂w = (2/3)[(−2)(1)+(−4)(2)+(−6)(3)] = (2/3)(−28) ≈ −18.7 and ∂L/∂b = (2/3)(−12) = −8. Both gradients are negative — meaning increasing w and b would lower the loss. With learning rate η = 0.1:

w ← 0 − 0.1·(−18.7) = 1.87 and b ← 0 − 0.1·(−8) = 0.8.

One step moved w from 0 to 1.87, already close to the true value of 2. Repeat a few dozen times and it converges. That's gradient descent.

3.3 The learning rate: the one knob you must feel

The learning rate η controls step size, and it's the single most consequential hyperparameter. Too small and training crawls — thousands of wasted steps inching toward the valley. Too large and you overshoot the bottom on every step, bouncing up the far wall and diverging (loss climbs to infinity or NaN). There's a Goldilocks band, and in practice people don't keep it fixed: they warm up (start tiny so early steps don't blow up) then decay it over training (smaller steps near the minimum for fine settling).

Play with it directly — drag the learning rate and watch the ball either glide into the valley, crawl, or fly off the surface:

◐ InteractiveGradient descent: feel the learning rate

step 0x -1.600loss 3.809∇ -1.656

learning rate: 0.80

→ Descending — following the negative gradient downhill. Keep stepping.

3.4 SGD, minibatches, and the vocabulary of a training run

The gradient formulas above sum over all n examples. With millions of examples, computing the full-dataset gradient for every single step is absurdly slow. The fix is Stochastic Gradient Descent (SGD): estimate the gradient from a small random minibatch of examples (say 32 or 256) instead of the whole dataset. Each estimate is noisy but unbiased, and you get to take many cheap steps instead of one expensive one.

The noise is a feature, not just a tax. The jitter from random batches helps the optimizer skip past shallow bad spots and tends to find flatter minima that generalize better — a mild, free regularization effect. (Regularization = anything that fights overfitting, i.e. memorizing the training data instead of learning the pattern.)

This gives you the three timing words that trip up beginners:

Term	Definition
Batch size	How many examples per gradient estimate (e.g. 256).
Step (iteration)	One forward+backward+update on one batch. One step = one parameter update.
Epoch	One full pass over the entire training set. `steps_per_epoch = dataset_size / batch_size`.

So "we trained for 3 epochs with batch size 256 on 1M examples" means 3 × (1,000,000 / 256) ≈ 11,700 parameter updates. LLM pretraining is usually described in steps or tokens, not epochs, because it sees each example roughly once.

3.5 Backpropagation: getting every gradient cheaply

In the toy example we had two parameters and wrote their gradients by hand. A real network has millions to billions, arranged in layers where each layer's output feeds the next. How do you get the gradient for a weight buried five layers deep? The honest answer is the chain rule from calculus — but the insight of backpropagation is the bookkeeping that makes it cheap.

A network's prediction is a composition of functions: loss(layer_N(...layer_2(layer_1(x)))). The chain rule says the derivative of a composition is the product of the derivatives of each step. Backprop runs in two passes:

Forward pass: feed x through the layers to compute the prediction and the loss, caching each layer's intermediate output.
Backward pass: start from the loss and walk backward, multiplying local derivatives layer by layer, propagating the "error signal" from the output back to every parameter.

The key efficiency point — the thing interviewers want — is that naively, computing each parameter's gradient separately would re-derive shared sub-expressions over and over (exponential blowup). Backprop computes the gradient of all parameters in one backward pass, at roughly the same cost as one forward pass, by reusing each intermediate result exactly once (this is dynamic programming on the computation graph). That O(network size) cost — rather than O(network size × parameters) — is why deep learning is computationally feasible at all. Modern frameworks call this autograd: they record the graph of operations during the forward pass and replay it backward automatically, so you never write a derivative by hand.

3.6 Optimizers: SGD → momentum → Adam/AdamW

Plain SGD takes a step proportional to the raw gradient. It works, but it's twitchy: it zig-zags across narrow valleys and stalls on flat plateaus. Optimizers are smarter recipes for turning gradients into updates.

Optimizer	Idea (intuition)	When
SGD	Step directly down the (minibatch) gradient.	Simple, well-tuned vision models; strong baseline.
SGD + Momentum	Accumulate a running average of past gradients — a "velocity" — so consistent directions build speed and noise cancels out. Like a heavy ball rolling downhill.	Most CNN training.
Adam	Per-parameter adaptive steps: divide each parameter's step by a running estimate of its own gradient magnitude, so rarely-updated and wildly-scaled parameters all move sensibly. Momentum + auto-scaling in one.	The default for transformers/NLP.
AdamW	Adam with weight decay decoupled from the gradient (a cleaner way to pull weights toward zero for regularization).	The de-facto standard for training and fine-tuning LLMs.

You don't need the update equations memorized for IC3. You need the story: momentum smooths the path using gradient history; Adam additionally gives each parameter its own adaptive step size; AdamW is the LLM default. When someone says "we used AdamW with a cosine schedule and linear warmup," you now know that means: the adaptive-momentum optimizer, learning rate ramped up then smoothly decayed.

3.7 The canonical loop

Every framework, every fine-tune, every LLM pretrain is this five-line rhythm repeated until the loss flattens:

forward   →   loss   →   backward   →   step   →   repeat
(predict)   (score)    (get grads)   (update)

That's it. Everything else — architecture, data, schedules — is detail bolted onto this skeleton.

4. See it in code

First, the whole loop in pure NumPy so nothing is hidden — linear regression by gradient descent, exactly the math from §3.2:

import numpy as np
 
# tiny dataset; the true rule is y = 2x (the model has to discover it)
X = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([2.0, 4.0, 6.0, 8.0])
 
w, b = 0.0, 0.0      # parameters we will learn (start at zero)
lr = 0.01            # learning rate η
n = len(X)
 
for step in range(1000):
    y_hat = w * X + b                     # 1. FORWARD: predict
    error = y_hat - y                     #    residual (ŷ − y)
    loss  = np.mean(error ** 2)           # 2. LOSS: mean squared error
    grad_w = (2 / n) * np.sum(error * X)  # 3. BACKWARD: ∂L/∂w
    grad_b = (2 / n) * np.sum(error)      #            ∂L/∂b
    w -= lr * grad_w                      # 4. STEP: descend the gradient
    b -= lr * grad_b
    if step % 200 == 0:
        print(f"step {step:4d}  loss {loss:7.4f}  w {w:.3f}  b {b:.3f}")

Line by line: the forward pass computes predictions; error and loss score them; grad_w/grad_b are the hand-derived gradients (this is what backprop automates for big networks); the two -= lines are the step. Run it and the loss falls toward 0 while w → 2 and b → 0. Bump lr to 1.0 and watch it diverge to NaN — that's an oversized learning rate, live.

✎ NumPy training loop — section by section

Setup (lines 186–191): Create tiny X and y (true rule is y = 2x), initialize parameters w and b to zero (guessing is allowed), set learning rate to 0.01, and note n = 4 examples.

Loop body (lines 193–202):

Line 194: Forward pass: multiply each x by weight w, add bias b, get predictions y_hat.
Line 195: Compute residual — the gap between prediction and truth.
Line 196: MSE loss — average of squared errors, one number telling us "how wrong."
Lines 197–198: Backward pass: hand-computed gradients. ∂L/∂w sums error * X (how the weight should move); ∂L/∂b sums the errors (how the bias should move). These are exactly the chain rule applied to MSE by hand.
Lines 199–200: Step: subtract learning rate times gradient from each parameter. Both w and b move downhill.
Lines 201–202: Print progress every 200 steps.

The whole loop is: measure wrongness, compute slopes, nudge the numbers, repeat. Thousands of tiny steps land you at the true rule.

Now the same loop in PyTorch, where autograd computes the gradients for you:

import torch
import torch.nn as nn
 
X = torch.tensor([[1.], [2.], [3.], [4.]])
y = torch.tensor([[2.], [4.], [6.], [8.]])
 
model   = nn.Linear(1, 1)                              # ŷ = wx + b
opt     = torch.optim.SGD(model.parameters(), lr=0.01) # optimizer holds the params
loss_fn = nn.MSELoss()
 
for step in range(1000):
    opt.zero_grad()              # clear last step's gradients (they accumulate!)
    loss = loss_fn(model(X), y)  # FORWARD + LOSS in one line
    loss.backward()              # BACKWARD: autograd fills every param's .grad
    opt.step()                   # STEP: optimizer updates params using .grad

The structure is identical — zero_grad → forward → loss → backward → step — but loss.backward() walks the computation graph and computes ∂loss/∂θ for every parameter automatically, and opt.step() applies the update rule. Swap SGD for torch.optim.AdamW and you've upgraded optimizers in one token. One subtlety worth internalizing: PyTorch accumulates gradients across backward() calls, so you must zero_grad() each step or your gradients pile up and training breaks — a classic first-day bug. The full runnable version, with both implementations and a divergence demo, lives in examples/ml/train_loop.py.

✎ PyTorch training loop — section by section

Setup (lines 213–218): Create X and y as tensors, build a linear model (one weight and one bias), pick SGD optimizer (holding pointers to the model's parameters), and choose MSE loss.

Loop body (lines 220–224):

Line 221: opt.zero_grad() — erase all accumulated gradients from the previous step. PyTorch adds new gradients to whatever was there; forgetting this causes them to pile up and destroy training.
Line 222: Forward pass and loss in one line — feed X through the model, compute predictions, and then compute MSE loss against y. One scalar result.
Line 223: loss.backward() — the magic: autograd walks the computation graph backward from the loss and fills every parameter's .grad attribute with ∂loss/∂θ. This is backpropagation, automated. The NumPy gradients we hand-computed are now computed instantly for millions of parameters.
Line 224: opt.step() — the optimizer reads .grad for each parameter and applies the update rule (for SGD: θ ← θ - η * grad).

The dance is identical to NumPy, but PyTorch auto-computes the slopes and applies the update, letting you swap optimizers (SGD → AdamW) in one word.

◇ Live illustrationConvergence vs overfitting

Training loss keeps falling; validation loss falls then rises as the model memorizes. The gap is overfitting — you stop at the validation minimum (early stopping).

5. Mental models & SWE analogies

Gradient ≈ the compiler error that tells you which direction to fix. It doesn't hand you the answer; it tells you, for every knob, which way reduces the problem and by how much.
Training loop ≈ gradient-based hill climbing / a feedback controller. Measure error, compute correction, apply a fraction of it (the learning rate is your control gain), repeat. Too much gain and the system oscillates and blows up.
Backprop ≈ memoized recursion / dynamic programming. The chain rule is the recurrence; backprop is the memoization that computes all derivatives in one pass instead of recomputing shared sub-expressions exponentially.
Learning rate ≈ a retry/backoff or simulated-annealing temperature. Big early, small late: explore boldly, then settle precisely.
Loss vs. eval metric ≈ the proxy you can differentiate vs. the KPI you actually ship. You optimize cross-entropy because it's smooth and differentiable; you report accuracy/F1/win-rate because that's what users feel. Conflating them is a real bug (/evals).

6. Common confusions

"The loss is the metric I care about." No — loss is a smooth, differentiable proxy chosen so gradients exist. Accuracy, BLEU, and win-rate are what you actually want; they're often non-differentiable, so you can't descend them directly.
"Logits are probabilities." Logits are raw unbounded scores; softmax turns them into probabilities. A logit of 8 isn't "80%."
"Backprop is a different algorithm from the chain rule." It is the chain rule — applied with smart caching so the whole gradient costs one backward pass. The cleverness is the bookkeeping, not new calculus.
"Bigger learning rate = faster learning." Up to a point. Past it, you overshoot every step and the loss diverges to NaN. Faster only until it's catastrophic.
"Epoch = step." An epoch is one full pass over the data; a step is one update on one minibatch. One epoch is usually thousands of steps.
"SGD uses one example at a time." In modern practice "SGD" means minibatch SGD — a small batch per step, not a single example.
"Adam is always better than SGD." Adam converges faster and is the default for transformers, but well-tuned SGD+momentum often generalizes as well or better in vision. "Default," not "universally optimal."
"More epochs always help." Past a point the model starts memorizing the training set (overfitting) and held-out performance degrades. You watch validation loss, not training loss.

7. Check yourself

[Prereq] What is a loss function, and why cross-entropy (not MSE) for classification? A loss function scores how wrong a prediction is as one number you minimize. MSE measures squared distance (for continuous targets); cross-entropy measures "surprise," −log of the probability assigned to the true class. Cross-entropy keeps a strong gradient exactly when the model is confidently wrong — where MSE's gradient goes flat — so classifiers (including LLMs predicting the next token) learn faster and more stably.

[Prereq] What are the four steps of one training iteration? Forward (predict), compute loss (score), backward (get gradients via backprop), step (update each parameter against its gradient). Repeat until the loss flattens.

[IC3] Explain gradient descent and the learning rate's tradeoff. The gradient points uphill on the loss surface; we step in the negative gradient direction: θ ← θ − η·∂L/∂θ. The learning rate η is step size. Too small → painfully slow convergence; too large → overshoot the minimum and diverge (loss → NaN). In practice people warm up then decay it.

[IC3] What do batch size, step, and epoch mean, and why minibatches? Batch size = examples per gradient estimate; a step = one update on one batch; an epoch = one full pass over the data (dataset_size / batch_size steps). Minibatches make each step cheap (so you take many) and inject useful gradient noise that regularizes and helps escape poor minima.

[IC4] What does backprop compute, and how do SGD, momentum, and Adam differ? Backprop computes ∂loss/∂θ for every parameter in a single backward pass by applying the chain rule with cached intermediates — O(network size), not O(size × params). SGD steps down the raw minibatch gradient; momentum adds a velocity (running average of gradients) to smooth and accelerate; Adam additionally scales each parameter's step by a running estimate of its own gradient magnitude (adaptive per-parameter rates), and AdamW decouples weight decay — the LLM default.

You're ready to move on when you can narrate the forward → loss → backward → step loop unprompted, say why cross-entropy beats MSE for classification, and explain what the learning rate trades off — without reaching for notes.

8. Go deeper

Stanford CS229 — Machine Learning (cs229.stanford.edu): the canonical derivation of gradient descent, MSE, and logistic regression from first principles.
Stanford CS231n — Optimization & Backprop notes (optimization-1, optimization-2): the clearest written treatment of SGD and backprop as a computation graph.
3Blue1Brown — Neural Networks series (3blue1brown.com): the visual intuition for gradient descent and backpropagation; watch before or after this lesson.
Dive into Deep Learning — Linear Regression & Optimization (d2l.ai): runnable notebooks of exactly the loop in §4.
Adam (arxiv 1412.6980) and AdamW (arxiv 1711.05101): the two optimizer papers you'll see cited everywhere — skim the intros now that you have the intuition.

Next: Neural Networks → — stack these trainable layers into a network and see what the forward pass and backprop actually run over, then on to Fine-tuning, which is this loop applied to a pretrained model.

Primary sources

← More in ML Foundations (for engineers)