ML Foundations (for engineers)
PrereqIC3

NLP and the Road to LLMs

An LLM is autocomplete trained on the internet — text becomes numbered tokens, meaning becomes geometry, attention lets every token see every other, and the whole thing learns by predicting the next token. This is the vocabulary every agents/RAG conversation assumes.

12 min read · 15 sections
Prerequisites: neural networks (forward pass, softmax), how models learn (loss, gradients)

1. The one-sentence intuition

A large language model (LLM) is autocomplete trained on a huge slice of the internet. Under the hood it does exactly one thing: read a sequence of tokens (numbered chunks of text) and predict the next one — then append it and repeat. Translation, code, chat, "reasoning" are all that single trick, scaled up. The SWE mental model: it's your IDE's autocomplete, except the "index" isn't a symbol table — it's a few hundred billion floating-point numbers tuned by gradient descent so that the most probable next token is usually a good one.

2. Why a software engineer needs this

This lesson is the bridge from "neural nets" to everything in the Agents and RAG pillars. Four concrete payoffs:

  • Tokens are the unit you pay for and reason about. Pricing, latency, and the context window (how much text fits) are all measured in tokens — not characters or words. "Why did my 50-page PDF blow the budget?" is a token question.
  • Embeddings are the entire foundation of RAG. Vector search, chunking, rerankers — all of it rests on "meaning as geometry," which starts here.
  • Next-token prediction explains the model's personality. Hallucination, confident wrongness, and why a knob called temperature exists all fall out of "it's sampling from a probability distribution," not "it's looking up facts."
  • Pretraining vs. post-training tells you what fine-tuning actually changes. When someone says "we did SFT then RLHF," you'll know which behaviors that touches.

Interviews silently assume you can answer: what is a token, what is an embedding, what did attention buy us over RNNs, and what do softmax/temperature do? Miss these and you sound like you've only used the API, never understood it.

3. Build it up from scratch

Beginner explainerNew here? The words first

The words first.

  • Token — a chunk of text the model reads and writes. It can be a whole word, part of a word, or punctuation. The model never sees raw letters, only tokens.
  • Vocabulary — the fixed list of every token the model knows (often ~50,000–100,000 of them). Every output must be one item from this list.
  • Logit — the raw score the model assigns to each token in the vocabulary for "what comes next." Higher means more favored. Logits are unbounded and can be negative; they are not yet probabilities.
  • Softmax — a formula that squashes the whole list of logits into probabilities: all positive, all between 0 and 1, summing to exactly 1.
  • Sampling — actually picking one token using those probabilities, like rolling weighted dice where likelier tokens have bigger faces.
  • Temperature — a knob applied before softmax. Low (0.2) sharpens the gap so the top token dominates (predictable); high (1.5) flattens it (random, creative).
  • Top-p (nucleus sampling) — keep only the smallest set of top tokens whose probabilities add up to p (say 0.9), throw away the unlikely tail, then sample from just those.

Step by step.

  1. Split the input text into tokens and feed them in.
  2. The model emits one logit for every token in the vocabulary — a long list of raw scores.
  3. Divide each logit by the temperature to sharpen or flatten.
  4. Run softmax to turn those scores into probabilities that sum to 1.
  5. Apply top-p: drop the tail, keep the top group, rescale so they sum to 1 again.
  6. Sample one token from what remains, append it, and repeat for the next token.

Remember this: the model never "knows" the next word — it scores every possible token, converts those scores to probabilities, and rolls weighted dice you can tune.

3.1 Step one: text → numbers (tokenization)

Neural nets eat vectors of numbers, not strings. So the first job is to chop text into pieces and map each piece to an integer. A token is one such piece; the fixed list of all possible pieces is the vocabulary; a token's index in that list is its token id. Tokenization is the lookup "hello" → 15339.

Why not just split on words? The vocabulary would be effectively infinite (every typo, name, snake_case identifier, and emoji is a new word), and any word you didn't see in training becomes an unknown blank — useless for code and proper nouns.

Why not characters? The vocabulary is tiny and there are no unknowns, but sequences get ~5× longer (every letter is a step), and a lone t carries almost no meaning, so the model wastes capacity relearning that t-h-e spells "the."

The winning compromise is subword tokenization, usually via Byte-Pair Encoding (BPE). Idea: start from individual bytes/characters, then repeatedly find the most frequent adjacent pair and merge it into a new token. Common chunks ("the", "ing", "tion") become single tokens; rare words gracefully fall back to smaller pieces.

Worked micro-example. Train on the words low, lower, newest, widest. Start as characters:

l o w   ·   l o w e r   ·   n e w e s t   ·   w i d e s t

The pair e s appears in both newest and widest (most frequent), so merge it → es. Now es t is the most frequent pair → merge → est. After a few merges, est is a single reusable token, so a brand-new word like slowest tokenizes as slow + est instead of seven characters — no "unknown" needed.

Practical facts worth carrying:

  • Real vocabularies are ~30k–130k tokens (GPT-2 used 50,257). Modern LLMs use byte-level BPE, so any string — Unicode, source code, gibberish — is representable; nothing is ever truly out-of-vocabulary.
  • A rule of thumb for English: 1 token ≈ 4 characters ≈ ¾ of a word. "tokenization" might split as token + ization; "Anthropic" as Anthrop + ic. Whitespace usually attaches to the following token.
  • Tokenization is mechanical and fixed before training — no learning yet. Learning starts once these ids become vectors.

3.2 Meaning as geometry (word embeddings)

A token id like 15339 is just a label; 15340 isn't "one more" than it. We want each token to become a vector where distance and direction encode meaning. That vector is an embedding: a list of, say, 768 learned numbers per token. The model keeps an embedding matrix of shape [vocab_size, d] and "looks up" each token's row — a learnable lookup table.

The classic result (word2vec, 2013; GloVe from Stanford, 2014) is that if you train these vectors so words appearing in similar contexts get similar vectors, geometry starts to mirror semantics. Similarity is measured by cosine similarity — the angle between two vectors:

ƒ
cos(a,b)=abab\cos(a, b) = \frac{a \cdot b}{\lVert a\rVert \, \lVert b\rVert}

Here a · b is the dot product (multiply componentwise, sum), and ‖a‖ is the vector's length. The result runs from −1 (opposite) to 1 (identical direction); "cat" and "dog" land close, "cat" and "thermodynamics" far apart. Even directions carry meaning, giving the famous analogy:

ƒ
vec(king)vec(man)+vec(woman)vec(queen)\text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen})

The "royalty" and "gender" concepts became roughly consistent directions in the space. This is exactly the trick RAG reuses: embed a query and your documents, then retrieve by cosine similarity.

The fatal limitation: word2vec/GloVe give one vector per word, forever. "bank" gets a single embedding that must average river bank and savings bank. There's no way to bend the vector based on the surrounding sentence. Real language is contextual — and fixing that is the rest of this story.

3.3 Reading in order (RNNs) and their bottleneck

To use context you must process a sequence. The first deep approach was the Recurrent Neural Network (RNN): walk left to right, maintaining a hidden state h (a vector summarizing everything seen so far). At each token: h_new = f(h_old, embedding(token)). The LSTM (1997) is a fancier RNN with gates that decide what to keep or forget, which helped it remember longer.

RNNs read context, but they have three painful bottlenecks:

  1. A single fixed-size hidden state must compress the entire history. Early tokens get blurred out by the time you reach token 500 — the model literally forgets the start of a long paragraph.
  2. Inherently sequential. Token 500 can't be computed until tokens 1–499 are done, so you can't parallelize across the sequence — brutal on GPUs.
  3. Vanishing gradients over long distances make learning relationships between far-apart words hard (covered in how models learn).

3.4 Attention: let every token look at every other token

The fix (Bahdanau 2015, then generalized) is attention: instead of cramming the past into one hidden state, let each token directly look at every other token and pull in what's relevant. For the word "it" in "the cat sat because it was tired," attention lets "it" reach back and weight "cat" heavily.

Mechanically, every token produces three vectors:

  • a query q — "what am I looking for?"
  • a key k — "what do I offer?"
  • a value v — "what I'll actually contribute if you attend to me."

A token compares its query to every other token's key via dot product (high dot product = relevant), softmaxes those scores into weights that sum to 1, and takes the weighted average of the corresponding values. In one line:

ƒ
Attention(Q,K,V)=softmax ⁣(QKd)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V
Attention — on real numbers

Every symbol in plain words: Q is the stacked queries from all tokens (what each token is looking for). K is the keys (what each token offers as a match). V is the values (the actual data each token contributes if matched). The term QK^T is the all-pairs dot product — token i's query dotted with every token's key, producing one score per candidate. The √d rescales to keep those scores from getting too large (d is the vector size, like 768). Softmax then converts each row of scores into weights that sum to 1. Finally, multiply by V to take a weighted average of the values.

Concrete numeric walk: Suppose we have 3 tokens, d = 2 (tiny for learning), and:

  • Q = [[1.0, 0.5], [0.2, 1.5], [0.8, 0.1]] (query vectors for tokens 1, 2, 3)
  • K = [[1.0, 0.0], [0.5, 1.0], [2.0, 0.5]] (key vectors)
  • V = [[0.1, 0.9], [0.8, 0.2], [0.5, 0.5]] (value vectors)

First, QK^T: token 1's query [1.0, 0.5] dots the first key [1.0, 0.0]1.0*1.0 + 0.5*0.0 = 1.0. Token 1 dots all three keys: [1.0, 0.5, 2.5]. Do this for all 3 queries (all 9 dot products):

QK^T = [[1.0,  0.5, 2.5],
        [1.3,  1.5, 2.3],
        [1.3,  0.3, 1.8]]

Divide by √2 ≈ 1.414: QK^T / √2 ≈ [[0.71, 0.35, 1.77], ...]. Softmax each row independently. Row 1: exponentiate to [e^0.71, e^0.35, e^1.77] ≈ [2.03, 1.42, 5.86], sum = 9.31, normalize: [0.22, 0.15, 0.63]. So token 1 attends most heavily to token 3 (weight 0.63), less to 1 and 2.

Finally, token 1's output is 0.22 * V[0] + 0.15 * V[1] + 0.63 * V[2]:

0.22 * [0.1, 0.9] + 0.15 * [0.8, 0.2] + 0.63 * [0.5, 0.5]
≈ [0.022, 0.198] + [0.12, 0.03] + [0.315, 0.315]
≈ [0.46, 0.54]

That weighted average is token 1's "attended" output — a blend of all values, weighted by how relevant each token's key was to its query.

What just happened: each token looked at every other token, computed relevance scores, turned them into a probability distribution, and pulled a weighted average of values. This is how "it" reaches back to "cat" — the word "it" gets a high-weight match to "cat"'s key and pulls in "cat"'s features.

Q, K, V are the stacked query/key/value vectors for all tokens; QKᵀ is the all-pairs relevance grid; dividing by √d (d = vector size) keeps the numbers from exploding; softmax turns each row into weights; multiplying by V is the weighted average. The key win: there's no fixed bottleneck and no sequential dependency — every pair interacts directly, and it all runs as parallel matrix multiplies. Now "bank" can finally become context-dependent: its output vector bends toward "river" or "money" based on its neighbors.

3.5 The Transformer (a teaser)

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") threw out recurrence entirely and built the network from stacked blocks of attention + a small feed-forward network, with position information added back in (since attention alone is order-blind). Because it's parallel and scales beautifully on GPUs, it's the architecture behind every LLM today. The internals — multi-head attention, positional encodings, residual streams, layer norm — are the entire Transformers pillar. For now, hold this: a Transformer is a deep stack of "every token attends to every token, then thinks for a moment."

3.6 What a language model actually does

Put it together. Feed in a sequence of tokens; the Transformer outputs, for the next position, one raw score per vocabulary token. Those raw scores are logits — unnormalized "how likely is this the next token?" numbers, ranging over all reals. To turn ~100k logits into a probability distribution, apply softmax:

ƒ
pi=ezi/Tjezj/Tp_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}
Softmax with temperature — on real numbers

Every symbol: z_i is the logit (raw score) for token i. e is exp (the exponential function). T is temperature. The numerator e^(z_i / T) exponentiates each logit. The denominator sums all exponentiated logits so they're normalized (sum to 1).

Concrete example: suppose the model produced logits [2.0, 1.0, 0.1, 1.5, -0.5] for vocab ["the", "cat", "sat", "dog", "ran"].

At T = 1 (normal softmax): Exponentiate: [e^2.0, e^1.0, e^0.1, e^1.5, e^-0.5] ≈ [7.39, 2.72, 1.10, 4.48, 0.61]. Sum ≈ 16.3. Divide each by 16.3:

[7.39/16.3, 2.72/16.3, 1.10/16.3, 4.48/16.3, 0.61/16.3]
≈ [0.453, 0.167, 0.068, 0.275, 0.037]

So "the" wins with 45.3%, "dog" second with 27.5%.

At T = 0.5 (cold, sharpened): Divide logits first: [2.0/0.5, 1.0/0.5, ...] = [4.0, 2.0, 0.2, 3.0, -1.0]. Exponentiate: [e^4.0, e^2.0, e^0.2, e^3.0, e^-1.0] ≈ [54.6, 7.39, 1.22, 20.1, 0.368]. Sum ≈ 83.7. Normalize: [0.652, 0.088, 0.015, 0.240, 0.004]. Now "the" dominates with 65%, and tiny logits are nearly impossible. Temperature sharpened the distribution.

At T = 2.0 (hot, flattened): Divide logits: [1.0, 0.5, 0.05, 0.75, -0.25]. Exponentiate: [e^1.0, e^0.5, ...] ≈ [2.72, 1.65, 1.05, 2.12, 0.78]. Sum ≈ 8.32. Normalize: [0.327, 0.198, 0.126, 0.255, 0.094]. Now the probabilities are much more uniform — "cat" and "ran" (previously tiny) have real weight, because high temperature blurs the differences.

What just happened: the temperature rescaled the logits before exponentiation, which amplifies or blunts their differences before normalization. Low T makes the best choice ever more dominant (deterministic, safe). High T flattens all choices toward uniform (random, creative).

z_i is the logit for token i, e is exp (makes everything positive and amplifies gaps), the denominator normalizes so all p_i sum to 1, and T is temperature (ignore it / set T=1 for now). You now have P(next token | everything so far). You pick a token, append it, and run again — autoregressive generation.

The genius is the training signal. The "label" for any position is simply the token that actually came next in the real text — which is free, sitting right there in the corpus. No human annotation. This is self-supervised learning: the data labels itself. Run it over trillions of tokens and minimizing next-token cross-entropy loss forces the model to absorb grammar, facts, code, and reasoning patterns, because predicting the next token well requires understanding the previous ones.

Decoding — turning probabilities into text:

Strategy What it does Effect
Greedy / argmax always take the highest-probability token deterministic, can loop or feel flat
Sampling draw a token in proportion to its probability varied, "creative," occasionally wrong
Temperature T rescale logits before softmax T<1 sharpens (safer), T>1 flattens (wilder), T→0 ≈ greedy
Top-p (nucleus) sample only from the smallest set of tokens whose probs sum to p (e.g. 0.9) cuts the long tail of nonsense while keeping variety

Temperature and top-p are the two knobs you'll actually turn in production. High temperature for brainstorming; low (or T=0) for extraction, classification, and anything you'll parse.

3.7 Pretraining vs. post-training, in one paragraph

What we just described — self-supervised next-token prediction on internet-scale text — is pretraining. It produces a base model: a brilliant autocomplete that will happily continue your prompt but doesn't reliably follow instructions or stay helpful/harmless. Post-training fixes that. Supervised fine-tuning (SFT) continues training on curated (instruction, ideal response) pairs so the model learns the assistant format. Then RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference comparisons ("response A is better than B") and nudges the LLM toward higher-rewarded outputs. Same next-token machinery, just pointed at behavior instead of raw text. That's the whole job of the Fine-tuning pillar.

◐ Live demoTokenizer: text → tokens → ids
Tokenization␣turns␣text␣into␣integers␣the␣model␣can␣read.

15 tokens · type to see the split. Real tokenizers learn subword merges (BPE) so common words are one token and rare words split into pieces — which is why token counts ≠ word counts and why spelling/maths can trip models up.

4. See it in code

The single most important transformation in the whole pipeline — logits → softmax → a sampled token — is a few lines of numpy. Everything before it (embeddings, attention, the Transformer stack) is "just" what produces the logits.

import numpy as np
 
# A language model's final layer emits one logit per vocabulary token:
# an unnormalized score for "is THIS the next token?". Toy 5-token vocab:
vocab  = ["the", "cat", "sat", "dog", "ran"]
logits = np.array([2.0, 1.0, 0.1, 1.5, -0.5])   # what the Transformer produced
 
def softmax(z, temperature=1.0):
    z = z / temperature          # temperature rescales the GAPS between logits
    z = z - z.max()              # subtract max: numerically safe, same result
    e = np.exp(z)                # exponentiate -> all positive, gaps amplified
    return e / e.sum()           # normalize -> a probability distribution (sums to 1)
 
probs = softmax(logits)
print(dict(zip(vocab, probs.round(3))))
# {'the': 0.453, 'cat': 0.167, 'sat': 0.068, 'dog': 0.275, 'ran': 0.037}
 
# Greedy decoding: always the argmax -> deterministic, can be repetitive.
print("greedy :", vocab[int(np.argmax(probs))])           # 'the'
 
# Sampling: draw in proportion to probability -> varied, "creative".
rng = np.random.default_rng(0)
print("sampled:", vocab[rng.choice(len(vocab), p=probs)])
 
# Temperature < 1 sharpens (more confident); > 1 flattens (more random).
print("T=0.5  :", softmax(logits, 0.5).round(3))   # mass piles onto 'the'
print("T=2.0  :", softmax(logits, 2.0).round(3))   # closer to uniform

Line by line: logits stands in for the model's output. softmax divides by temperature (so a small T exaggerates the lead of the top token, a large T evens everyone out), subtracts the max for numerical stability, exponentiates, and normalizes. argmax is greedy decoding; rng.choice(..., p=probs) is sampling — the reason two identical prompts can give different answers. Run it and watch T=0.5 concentrate probability on "the" while T=2.0 spreads it out. That's the entire intuition behind the temperature slider in any chat API.

Softmax code, section by section

Setup: logits is a numpy array of 5 raw scores, one per vocabulary token. vocab is the list of words.

The softmax function:

  1. Divide by temperature (line 179): z = z / temperature. This rescales the logits: small T exaggerates gaps, large T flattens them. At T=1, nothing changes; at T=0.5, logits double (gaps widen); at T=2, they halve (flatten).
  2. Subtract the max (line 180): z = z - z.max(). Pure numerics — before exponentiating huge numbers, shift down so the biggest logit becomes 0. This prevents overflow and doesn't change the final probabilities (softmax is invariant to adding/subtracting a constant).
  3. Exponentiate (line 181): e = np.exp(z). Now all values are positive, and gaps are amplified (exp of 2.0 is much bigger than exp of 0.0). The exponentiation is what makes softmax nonlinear — bigger logits don't just get higher weight, they get exponentially higher weight.
  4. Normalize (line 182): e / e.sum(). Divide each exponentiated value by the sum so all probabilities add to 1. Now you have a valid probability distribution.

Using it:

  • probs = softmax(logits) on line 184 produces the distribution.
  • Greedy (line 189): argmax(probs) picks the highest-probability token — deterministic, reliable, can repeat.
  • Sampling (line 193): rng.choice(..., p=probs) rolls weighted dice using those probabilities — different runs give different answers (the seed on line 192 sets the random state for reproducibility).
  • Temperature comparison (lines 196–197): T=0.5 concentrates probability heavily on the top token; T=2.0 spreads it closer to uniform.

The whole snippet teaches: from raw logits, divide by temperature, exponentiate, and normalize → probabilities. Then decode: greedy or sample. This is the final step of every LLM inference.

◇ Live illustrationAttention: which words look at which

Self-attention lets every token weigh every other token. Here the query token's arcs thicken with attention weight — how 'it' resolves to 'cat'. This is the operation transformers are built from.

5. Mental models & SWE analogies

  • Tokenizer ≈ a lexer. It deterministically chops a string into a stream of known atoms (tokens) before the "compiler" (the model) runs. BPE just learns its token set from data instead of a hand-written grammar.
  • Embedding ≈ a hash that preserves meaning. A normal hash scatters similar inputs randomly; an embedding does the opposite — similar meanings map to nearby vectors, which is exactly what makes vector search work.
  • Attention ≈ a content-addressable cache / a soft JOIN. Each token issues a query and gets back a weighted blend of every other token's value, keyed by relevance — a differentiable lookup over the whole sequence instead of an exact key match.
  • Logits ≈ pre-Response, softmax ≈ serialization to a probability JSON. Logits are the raw internal scores; softmax is the normalization step that makes them a well-formed distribution you can sample from.
  • Pretraining ≈ a general-purpose stdlib; post-training ≈ your app code. Pretraining gives broad capability; SFT/RLHF specialize it into a helpful assistant without rewriting the foundation.

6. Common confusions

  • "A token is a word." No — it's a subword. "unbelievable" can be 3 tokens; a space often rides along with the next token. Count tokens, not words, for cost and limits.
  • "Embeddings are looked up from a fixed dictionary." Static word vectors (word2vec/GloVe) were. Inside an LLM, the output vector for a token is recomputed in context by attention — "bank" differs by sentence.
  • "The model stores facts and retrieves them." It stores weights, not a database. It generates the statistically likely next token, which is why it can be fluently, confidently wrong (hallucinate). Grounding it is what RAG is for.
  • "Higher temperature = smarter." Temperature only controls randomness, not capability. For parsing, extraction, or classification you usually want T=0.
  • "Softmax outputs are calibrated confidence." They're a distribution, not a guarantee of correctness — a model can put 0.95 on a wrong token.
  • "RLHF teaches the model new facts." Post-training mostly shapes behavior and format (be helpful, follow instructions); the knowledge came from pretraining.

7. Check yourself

[Prereq] What is a token, and why subword instead of words or characters? A token is a chunk of text mapped to an integer id from a fixed vocabulary. Word-level vocabularies are unbounded and choke on unseen words; character-level makes sequences too long and units too meaningless. Subword/BPE merges frequent character pairs, so common chunks are one token and rare words fall back to smaller pieces — finite vocab, no out-of-vocabulary blanks.
[Prereq] What does an embedding give you that a token id doesn't? A token id is an arbitrary label with no notion of similarity. An embedding is a learned vector where distance/direction encode meaning, so "cat" and "dog" sit close (high cosine similarity) and analogies like king − man + woman ≈ queen hold. That geometry is what powers semantic search and RAG.
[IC3] Walk from the model's final layer to a sampled word. The final layer emits logits — one unnormalized score per vocab token. Softmax turns them into a probability distribution summing to 1. Temperature rescales logits first (low = sharper/safer, high = flatter/wilder). You then decode: greedy (argmax), or sample — optionally restricted by top-p, which keeps only the smallest set of tokens whose probabilities sum to p. Append the chosen token and repeat (autoregressive generation).
[IC4] Why did attention/Transformers replace RNNs? RNNs squeeze all history into one fixed hidden state (forgetting long-range context), must run strictly left-to-right (no parallelism), and suffer vanishing gradients over distance. Attention lets every token directly attend to every other via query–key–value, removing the bottleneck and the sequential dependency, so it parallelizes on GPUs and captures long-range relations. The Transformer stacks attention + feed-forward blocks and scales.
[IC3] What's the difference between pretraining and post-training? Pretraining is self-supervised next-token prediction over massive text, yielding a capable but unaligned base model. Post-training (SFT on instruction pairs, then RLHF/DPO on human preferences) shapes that model into an instruction-following, helpful, safe assistant — same machinery, aimed at behavior.

You're ready to move on when you can explain, end to end, how the sentence "the cat sat" becomes token ids, then context-aware vectors, then logits, then a sampled next word — and say in one breath why attention beat RNNs.

8. Go deeper

Next: you now have the vocabulary — go open the Transformers pillar to see how attention is actually built, or jump to Building AI Agents to put a chat model to work. For the retrieval side of embeddings, head to RAG → chunking & embeddings.

Primary sources
← More in ML Foundations (for engineers)