An LLM is autocomplete trained on the internet — text becomes numbered tokens, meaning becomes geometry, attention lets every token see every other, and the whole thing learns by predicting the next token. This is the vocabulary every agents/RAG conversation assumes.
A large language model (LLM) is autocomplete trained on a huge slice of the internet. Under the hood it does exactly one thing: read a sequence of tokens (numbered chunks of text) and predict the next one — then append it and repeat. Translation, code, chat, "reasoning" are all that single trick, scaled up. The SWE mental model: it's your IDE's autocomplete, except the "index" isn't a symbol table — it's a few hundred billion floating-point numbers tuned by gradient descent so that the most probable next token is usually a good one.
This lesson is the bridge from "neural nets" to everything in the Agents and RAG pillars. Four concrete payoffs:
Interviews silently assume you can answer: what is a token, what is an embedding, what did attention buy us over RNNs, and what do softmax/temperature do? Miss these and you sound like you've only used the API, never understood it.
The words first.
0.2) sharpens the gap so the top token dominates (predictable); high (1.5) flattens it (random, creative).p (say 0.9), throw away the unlikely tail, then sample from just those.Step by step.
Remember this: the model never "knows" the next word — it scores every possible token, converts those scores to probabilities, and rolls weighted dice you can tune.
Neural nets eat vectors of numbers, not strings. So the first job is to chop text into pieces and map each piece to an integer. A token is one such piece; the fixed list of all possible pieces is the vocabulary; a token's index in that list is its token id. Tokenization is the lookup "hello" → 15339.
Why not just split on words? The vocabulary would be effectively infinite (every typo, name, snake_case identifier, and emoji is a new word), and any word you didn't see in training becomes an unknown blank — useless for code and proper nouns.
Why not characters? The vocabulary is tiny and there are no unknowns, but sequences get ~5× longer (every letter is a step), and a lone t carries almost no meaning, so the model wastes capacity relearning that t-h-e spells "the."
The winning compromise is subword tokenization, usually via Byte-Pair Encoding (BPE). Idea: start from individual bytes/characters, then repeatedly find the most frequent adjacent pair and merge it into a new token. Common chunks ("the", "ing", "tion") become single tokens; rare words gracefully fall back to smaller pieces.
Worked micro-example. Train on the words low, lower, newest, widest. Start as characters:
l o w · l o w e r · n e w e s t · w i d e s tThe pair e s appears in both newest and widest (most frequent), so merge it → es. Now es t is the most frequent pair → merge → est. After a few merges, est is a single reusable token, so a brand-new word like slowest tokenizes as slow + est instead of seven characters — no "unknown" needed.
Practical facts worth carrying:
token + ization; "Anthropic" as Anthrop + ic. Whitespace usually attaches to the following token.A token id like 15339 is just a label; 15340 isn't "one more" than it. We want each token to become a vector where distance and direction encode meaning. That vector is an embedding: a list of, say, 768 learned numbers per token. The model keeps an embedding matrix of shape [vocab_size, d] and "looks up" each token's row — a learnable lookup table.
The classic result (word2vec, 2013; GloVe from Stanford, 2014) is that if you train these vectors so words appearing in similar contexts get similar vectors, geometry starts to mirror semantics. Similarity is measured by cosine similarity — the angle between two vectors:
Here a · b is the dot product (multiply componentwise, sum), and ‖a‖ is the vector's length. The result runs from −1 (opposite) to 1 (identical direction); "cat" and "dog" land close, "cat" and "thermodynamics" far apart. Even directions carry meaning, giving the famous analogy:
The "royalty" and "gender" concepts became roughly consistent directions in the space. This is exactly the trick RAG reuses: embed a query and your documents, then retrieve by cosine similarity.
The fatal limitation: word2vec/GloVe give one vector per word, forever. "bank" gets a single embedding that must average river bank and savings bank. There's no way to bend the vector based on the surrounding sentence. Real language is contextual — and fixing that is the rest of this story.
To use context you must process a sequence. The first deep approach was the Recurrent Neural Network (RNN): walk left to right, maintaining a hidden state h (a vector summarizing everything seen so far). At each token: h_new = f(h_old, embedding(token)). The LSTM (1997) is a fancier RNN with gates that decide what to keep or forget, which helped it remember longer.
RNNs read context, but they have three painful bottlenecks:
The fix (Bahdanau 2015, then generalized) is attention: instead of cramming the past into one hidden state, let each token directly look at every other token and pull in what's relevant. For the word "it" in "the cat sat because it was tired," attention lets "it" reach back and weight "cat" heavily.
Mechanically, every token produces three vectors:
q — "what am I looking for?"k — "what do I offer?"v — "what I'll actually contribute if you attend to me."A token compares its query to every other token's key via dot product (high dot product = relevant), softmaxes those scores into weights that sum to 1, and takes the weighted average of the corresponding values. In one line:
Every symbol in plain words: Q is the stacked queries from all tokens (what each token is looking for). K is the keys (what each token offers as a match). V is the values (the actual data each token contributes if matched). The term QK^T is the all-pairs dot product — token i's query dotted with every token's key, producing one score per candidate. The √d rescales to keep those scores from getting too large (d is the vector size, like 768). Softmax then converts each row of scores into weights that sum to 1. Finally, multiply by V to take a weighted average of the values.
Concrete numeric walk: Suppose we have 3 tokens, d = 2 (tiny for learning), and:
Q = [[1.0, 0.5], [0.2, 1.5], [0.8, 0.1]] (query vectors for tokens 1, 2, 3)K = [[1.0, 0.0], [0.5, 1.0], [2.0, 0.5]] (key vectors)V = [[0.1, 0.9], [0.8, 0.2], [0.5, 0.5]] (value vectors)First, QK^T: token 1's query [1.0, 0.5] dots the first key [1.0, 0.0] → 1.0*1.0 + 0.5*0.0 = 1.0. Token 1 dots all three keys: [1.0, 0.5, 2.5]. Do this for all 3 queries (all 9 dot products):
QK^T = [[1.0, 0.5, 2.5],
[1.3, 1.5, 2.3],
[1.3, 0.3, 1.8]]Divide by √2 ≈ 1.414: QK^T / √2 ≈ [[0.71, 0.35, 1.77], ...]. Softmax each row independently. Row 1: exponentiate to [e^0.71, e^0.35, e^1.77] ≈ [2.03, 1.42, 5.86], sum = 9.31, normalize: [0.22, 0.15, 0.63]. So token 1 attends most heavily to token 3 (weight 0.63), less to 1 and 2.
Finally, token 1's output is 0.22 * V[0] + 0.15 * V[1] + 0.63 * V[2]:
0.22 * [0.1, 0.9] + 0.15 * [0.8, 0.2] + 0.63 * [0.5, 0.5]
≈ [0.022, 0.198] + [0.12, 0.03] + [0.315, 0.315]
≈ [0.46, 0.54]That weighted average is token 1's "attended" output — a blend of all values, weighted by how relevant each token's key was to its query.
What just happened: each token looked at every other token, computed relevance scores, turned them into a probability distribution, and pulled a weighted average of values. This is how "it" reaches back to "cat" — the word "it" gets a high-weight match to "cat"'s key and pulls in "cat"'s features.
Q, K, V are the stacked query/key/value vectors for all tokens; QKᵀ is the all-pairs relevance grid; dividing by √d (d = vector size) keeps the numbers from exploding; softmax turns each row into weights; multiplying by V is the weighted average. The key win: there's no fixed bottleneck and no sequential dependency — every pair interacts directly, and it all runs as parallel matrix multiplies. Now "bank" can finally become context-dependent: its output vector bends toward "river" or "money" based on its neighbors.
The Transformer (Vaswani et al., 2017, "Attention Is All You Need") threw out recurrence entirely and built the network from stacked blocks of attention + a small feed-forward network, with position information added back in (since attention alone is order-blind). Because it's parallel and scales beautifully on GPUs, it's the architecture behind every LLM today. The internals — multi-head attention, positional encodings, residual streams, layer norm — are the entire Transformers pillar. For now, hold this: a Transformer is a deep stack of "every token attends to every token, then thinks for a moment."
Put it together. Feed in a sequence of tokens; the Transformer outputs, for the next position, one raw score per vocabulary token. Those raw scores are logits — unnormalized "how likely is this the next token?" numbers, ranging over all reals. To turn ~100k logits into a probability distribution, apply softmax:
Every symbol: z_i is the logit (raw score) for token i. e is exp (the exponential function). T is temperature. The numerator e^(z_i / T) exponentiates each logit. The denominator sums all exponentiated logits so they're normalized (sum to 1).
Concrete example: suppose the model produced logits [2.0, 1.0, 0.1, 1.5, -0.5] for vocab ["the", "cat", "sat", "dog", "ran"].
At T = 1 (normal softmax):
Exponentiate: [e^2.0, e^1.0, e^0.1, e^1.5, e^-0.5] ≈ [7.39, 2.72, 1.10, 4.48, 0.61].
Sum ≈ 16.3. Divide each by 16.3:
[7.39/16.3, 2.72/16.3, 1.10/16.3, 4.48/16.3, 0.61/16.3]
≈ [0.453, 0.167, 0.068, 0.275, 0.037]So "the" wins with 45.3%, "dog" second with 27.5%.
At T = 0.5 (cold, sharpened):
Divide logits first: [2.0/0.5, 1.0/0.5, ...] = [4.0, 2.0, 0.2, 3.0, -1.0].
Exponentiate: [e^4.0, e^2.0, e^0.2, e^3.0, e^-1.0] ≈ [54.6, 7.39, 1.22, 20.1, 0.368].
Sum ≈ 83.7. Normalize: [0.652, 0.088, 0.015, 0.240, 0.004].
Now "the" dominates with 65%, and tiny logits are nearly impossible. Temperature sharpened the distribution.
At T = 2.0 (hot, flattened):
Divide logits: [1.0, 0.5, 0.05, 0.75, -0.25].
Exponentiate: [e^1.0, e^0.5, ...] ≈ [2.72, 1.65, 1.05, 2.12, 0.78].
Sum ≈ 8.32. Normalize: [0.327, 0.198, 0.126, 0.255, 0.094].
Now the probabilities are much more uniform — "cat" and "ran" (previously tiny) have real weight, because high temperature blurs the differences.
What just happened: the temperature rescaled the logits before exponentiation, which amplifies or blunts their differences before normalization. Low T makes the best choice ever more dominant (deterministic, safe). High T flattens all choices toward uniform (random, creative).
z_i is the logit for token i, e is exp (makes everything positive and amplifies gaps), the denominator normalizes so all p_i sum to 1, and T is temperature (ignore it / set T=1 for now). You now have P(next token | everything so far). You pick a token, append it, and run again — autoregressive generation.
The genius is the training signal. The "label" for any position is simply the token that actually came next in the real text — which is free, sitting right there in the corpus. No human annotation. This is self-supervised learning: the data labels itself. Run it over trillions of tokens and minimizing next-token cross-entropy loss forces the model to absorb grammar, facts, code, and reasoning patterns, because predicting the next token well requires understanding the previous ones.
Decoding — turning probabilities into text:
| Strategy | What it does | Effect |
|---|---|---|
| Greedy / argmax | always take the highest-probability token | deterministic, can loop or feel flat |
| Sampling | draw a token in proportion to its probability | varied, "creative," occasionally wrong |
Temperature T |
rescale logits before softmax | T<1 sharpens (safer), T>1 flattens (wilder), T→0 ≈ greedy |
| Top-p (nucleus) | sample only from the smallest set of tokens whose probs sum to p (e.g. 0.9) |
cuts the long tail of nonsense while keeping variety |
Temperature and top-p are the two knobs you'll actually turn in production. High temperature for brainstorming; low (or T=0) for extraction, classification, and anything you'll parse.
What we just described — self-supervised next-token prediction on internet-scale text — is pretraining. It produces a base model: a brilliant autocomplete that will happily continue your prompt but doesn't reliably follow instructions or stay helpful/harmless. Post-training fixes that. Supervised fine-tuning (SFT) continues training on curated (instruction, ideal response) pairs so the model learns the assistant format. Then RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference comparisons ("response A is better than B") and nudges the LLM toward higher-rewarded outputs. Same next-token machinery, just pointed at behavior instead of raw text. That's the whole job of the Fine-tuning pillar.
15 tokens · type to see the split. Real tokenizers learn subword merges (BPE) so common words are one token and rare words split into pieces — which is why token counts ≠ word counts and why spelling/maths can trip models up.
The single most important transformation in the whole pipeline — logits → softmax → a sampled token — is a few lines of numpy. Everything before it (embeddings, attention, the Transformer stack) is "just" what produces the logits.
import numpy as np
# A language model's final layer emits one logit per vocabulary token:
# an unnormalized score for "is THIS the next token?". Toy 5-token vocab:
vocab = ["the", "cat", "sat", "dog", "ran"]
logits = np.array([2.0, 1.0, 0.1, 1.5, -0.5]) # what the Transformer produced
def softmax(z, temperature=1.0):
z = z / temperature # temperature rescales the GAPS between logits
z = z - z.max() # subtract max: numerically safe, same result
e = np.exp(z) # exponentiate -> all positive, gaps amplified
return e / e.sum() # normalize -> a probability distribution (sums to 1)
probs = softmax(logits)
print(dict(zip(vocab, probs.round(3))))
# {'the': 0.453, 'cat': 0.167, 'sat': 0.068, 'dog': 0.275, 'ran': 0.037}
# Greedy decoding: always the argmax -> deterministic, can be repetitive.
print("greedy :", vocab[int(np.argmax(probs))]) # 'the'
# Sampling: draw in proportion to probability -> varied, "creative".
rng = np.random.default_rng(0)
print("sampled:", vocab[rng.choice(len(vocab), p=probs)])
# Temperature < 1 sharpens (more confident); > 1 flattens (more random).
print("T=0.5 :", softmax(logits, 0.5).round(3)) # mass piles onto 'the'
print("T=2.0 :", softmax(logits, 2.0).round(3)) # closer to uniformLine by line: logits stands in for the model's output. softmax divides by temperature (so a small T exaggerates the lead of the top token, a large T evens everyone out), subtracts the max for numerical stability, exponentiates, and normalizes. argmax is greedy decoding; rng.choice(..., p=probs) is sampling — the reason two identical prompts can give different answers. Run it and watch T=0.5 concentrate probability on "the" while T=2.0 spreads it out. That's the entire intuition behind the temperature slider in any chat API.
Setup: logits is a numpy array of 5 raw scores, one per vocabulary token. vocab is the list of words.
The softmax function:
z = z / temperature. This rescales the logits: small T exaggerates gaps, large T flattens them. At T=1, nothing changes; at T=0.5, logits double (gaps widen); at T=2, they halve (flatten).z = z - z.max(). Pure numerics — before exponentiating huge numbers, shift down so the biggest logit becomes 0. This prevents overflow and doesn't change the final probabilities (softmax is invariant to adding/subtracting a constant).e = np.exp(z). Now all values are positive, and gaps are amplified (exp of 2.0 is much bigger than exp of 0.0). The exponentiation is what makes softmax nonlinear — bigger logits don't just get higher weight, they get exponentially higher weight.e / e.sum(). Divide each exponentiated value by the sum so all probabilities add to 1. Now you have a valid probability distribution.Using it:
probs = softmax(logits) on line 184 produces the distribution.argmax(probs) picks the highest-probability token — deterministic, reliable, can repeat.rng.choice(..., p=probs) rolls weighted dice using those probabilities — different runs give different answers (the seed on line 192 sets the random state for reproducibility).T=0.5 concentrates probability heavily on the top token; T=2.0 spreads it closer to uniform.The whole snippet teaches: from raw logits, divide by temperature, exponentiate, and normalize → probabilities. Then decode: greedy or sample. This is the final step of every LLM inference.
Self-attention lets every token weigh every other token. Here the query token's arcs thicken with attention weight — how 'it' resolves to 'cat'. This is the operation transformers are built from.
JOIN. Each token issues a query and gets back a weighted blend of every other token's value, keyed by relevance — a differentiable lookup over the whole sequence instead of an exact key match.Response, softmax ≈ serialization to a probability JSON. Logits are the raw internal scores; softmax is the normalization step that makes them a well-formed distribution you can sample from.T=0.You're ready to move on when you can explain, end to end, how the sentence "the cat sat" becomes token ids, then context-aware vectors, then logits, then a sampled next word — and say in one breath why attention beat RNNs.
Next: you now have the vocabulary — go open the Transformers pillar to see how attention is actually built, or jump to Building AI Agents to put a chat model to work. For the retrieval side of embeddings, head to RAG → chunking & embeddings.