RAG is a retrieval bet, not a generation bet — learn the full ingest→retrieve→rerank→generate pipeline, why naive RAG dies in production, and when to reach for RAG vs fine-tuning vs a 1M-token window.
ai-eng-wiki/examples/rag/mini_rag.pyRetrieval-Augmented Generation is the discipline of putting the right tokens in front of the model at inference time instead of baking knowledge into weights. The model is a fixed reasoning engine; RAG is the supply chain that feeds it fresh, large, private, and attributable facts on demand. The mental model that separates juniors from staff: RAG is a retrieval problem wearing a generation costume. Retrieval is almost always the bottleneck — if the right chunk never reaches the prompt, no amount of model intelligence recovers it. Everything in the RAG pillar is a refinement of one question: given a query, how do I reliably surface the small set of passages that actually answer it? This lesson sets up the whole pillar — the pipeline, the naive→advanced→agentic progression, and the RAG-vs-fine-tune-vs-long-context decision.
RAG is the single most common system-design prompt for applied AI roles because it touches retrieval, embeddings, latency budgets, evals, and the model itself. Interviewers use it to calibrate altitude: can you reason about the whole system, or only the part you've touched?
At IC4, the tells are:
At IC5, the tells are:
At IC6/staff, the tells are:
The core idea. Retrieval-Augmented Generation (RAG) is the practice of retrieving relevant documents or passages from a corpus and putting those passages into the prompt before asking the model a question. Instead of baking all knowledge into the model's weights during training, you keep knowledge outside the model and fetch it at query time. The why.
Pre-training and fine-tuning write knowledge into weights. That is a terrible substrate for four kinds of knowledge:
The original framing comes from Lewis et al., 2020, who paired a parametric model (the weights) with a non-parametric memory (a retrievable index) and showed it beat closed-book models on knowledge-intensive tasks. The deeper reason RAG reduces hallucination: you change the model's job from recall ("what do I remember about X?") to reading comprehension ("here are five passages, answer from them"). Reading comprehension is a far more reliable behavior than parametric recall, and it gives you a citation for free. See /rag/evaluation for how we measure that as faithfulness — the fraction of claims in the answer supported by the retrieved context.
There are two halves: an offline ingest path and an online query path.
Ingest (offline, runs when documents change):
ingest → chunk → embed → indexQuery (online, runs per request):
query → (rewrite) → retrieve (dense ± sparse) → fuse → rerank → assemble context → generate → (cite)What it does: RRF combines two ranked lists (one from dense search, one from sparse BM25) into a single list that leverages both.
The formula: For each chunk, sum up 1 / (k + rank) where rank is its position in each list, and k is a constant (typically 60). Higher score wins.
Concrete example: Say dense search ranked a chunk at position 1, and BM25 ranked it at position 5. With k = 60:
1 / (60 + 1) = 1/61 ≈ 0.0161 / (60 + 5) = 1/65 ≈ 0.0150.016 + 0.015 = 0.031Now say another chunk was dense rank 10 but BM25 rank 2:
1 / (60 + 10) = 1/70 ≈ 0.0141 / (60 + 2) = 1/62 ≈ 0.0160.014 + 0.016 = 0.030The first chunk (0.031 > 0.030) ranks higher in the fused list. What just happened: RRF penalized the second chunk for being buried in dense search despite ranking high in sparse — it rewards chunks that appear near the top in both lists.
At each stage, the candidate set gets smaller and more precise.
Cost vs. quality: Early stages are cheap (vector DB + BM25 lookups <10ms), reranking is expensive (~100–400ms), and generation dominates overall latency. The trick: spend early-stage compute to narrow aggressively so generation runs on a tightly focused context.
Query: “How much did contextual retrieval cut failed retrievals?”
→ The user's question — note it asks 'how much', a number.
Step through the interactive pipeline above — watch a single query flow from rewrite to retrieval to fusion to rerank to generation, and notice how the candidate set shrinks and reorders at each stage. The thing to internalize: each stage trades cost/latency for recall or precision, and you tune the pipeline as a budget, not a checklist.
This is the spine of the whole pillar, and a favorite interview arc.
Naive RAG = chunk → embed → cosine → stuff the top-k into the prompt. It is the right thing to build first to get a baseline, and it is dead on arrival for most production workloads. Why it fails:
ERR_2043), SKUs, function names, person names, acronyms. Ask for get_user_by_id and a dense retriever may hand you fetch_account_details because they're semantically close. BM25 — a sparse, term-frequency ranker — nails exact and rare tokens but misses paraphrase. Neither alone is enough; hybrid wins (see /rag/hybrid-search-and-rerank).Dense (bi-encoder embeddings):
"how do I get the current user ID?" → dense vector ≈ [0.2, 0.8, -0.1, ...]."Call fetch_account_details(acct_id) to retrieve the user object." → dense vector ≈ [0.19, 0.82, -0.09, ...] (semantically similar!).get_user_by_id.Sparse (BM25):
["how", "do", "i", "get", "current", "user", "id"].["call", "fetch_account_details", "acct_id", "retrieve", "user", "object"].["user"].get and id don't appear."Use get_user_by_id(user_token) to fetch the active user." has ["get", "user", "id"] → BM25 ranks this HIGH.Hybrid (Dense + BM25 fused):
Advanced RAG fixes these with three families of technique: hybrid retrieval (dense + sparse fused with RRF), a reranking stage (cross-encoder over a top-50/100 candidate set), and query transforms (rewriting, decomposition, HyDE, multi-query). Anthropic's Contextual Retrieval (Sept 2024) is a clean exemplar: use an LLM to prepend a short, chunk-specific blurb situating each chunk in its parent document before embedding and BM25 indexing. Anthropic reported this cut failed retrievals by ~49% (contextual embeddings + contextual BM25), and ~67% once reranking was added. Prompt caching makes the per-chunk LLM annotation cheap because the full document sits in the cached prefix.
Agentic RAG removes the assumption that you retrieve exactly once. An agent decides whether to retrieve at all, what to retrieve, reformulates when the first pass is thin, routes among multiple sources, retrieves iteratively, and self-checks groundedness before answering. This ties directly into the /agents pillar and tool use / MCP: retrieval becomes a tool the model calls, not a fixed preprocessing step. It's more capable and more expensive — multiple model round-trips per answer — so you reserve it for multi-hop and open-ended questions. The progression is a cost/quality ladder; you climb it only as far as your eval and your latency budget demand.
Internalize this and you'll debug RAG correctly for the rest of your career: retrieval is the failure, generation is the symptom. When a RAG system "hallucinates," nine times out of ten the right chunk never made it into the context, so the model fell back on parametric recall. The fix is upstream — better chunking, hybrid search, reranking, query rewriting — not a sterner prompt. This is why /rag/evaluation insists on measuring retrieval and generation separately: recall@k / nDCG / MRR for the retriever, faithfulness / answer relevancy for the generator. A system with 60% context recall cannot exceed 60% answerable questions no matter how good the model is.
These are not competitors; they are complementary tools that compose. The staff-level answer always starts by refusing the false dichotomy.
claude-opus-4-8 and claude-sonnet-4-6; see /context-engineering) is for whole-document reasoning — when the relevant material fits and cross-references span the whole doc. But it carries real costs: you pay for every token every call, latency scales with input, and "lost in the middle" still bites. A 1M-token window is not a license to skip retrieval; it's a license to retrieve coarser (fewer, larger chunks) and let the model do more in-context reasoning.The composition in practice: RAG narrows millions of documents to the few thousand tokens that matter, long-context lets each retrieved unit be a large parent section instead of a tiny snippet, and a light fine-tune locks in citation format and refusal behavior. They stack.
Documents are chunked, embedded and indexed offline; at query time the system retrieves, fuses, reranks, and grounds the answer. Watch the data flow.
A tiny but real end-to-end RAG: BM25 sparse retrieval over a small corpus, then a Claude generation step constrained to the retrieved context. This is deliberately sparse-only and in-memory so it runs in seconds — the full version (dense + hybrid + rerank + eval) lives at examples/rag/mini_rag.py.
"""Tiny end-to-end RAG: BM25 retrieval + Claude generation, grounded and cited.
pip install rank_bm25 anthropic # ANTHROPIC_API_KEY in env
Full hybrid+rerank+eval version: examples/rag/mini_rag.py
"""
import anthropic
from rank_bm25 import BM25Okapi
# In production this CORPUS is the output of an ingest+chunk pipeline,
# each item carrying source/title/ACL metadata. Here: five flat chunks.
CORPUS = [
"Anthropic's Contextual Retrieval prepends chunk-specific context before "
"embedding and BM25 indexing, cutting failed retrievals by ~49%.",
"Reciprocal Rank Fusion merges ranked lists: score = sum of 1/(k+rank), k about 60.",
"BM25 is a sparse lexical ranker, strong on exact terms, codes, and rare tokens.",
"Dense bi-encoder embeddings capture paraphrase but miss exact-token matches.",
"Cross-encoder rerankers score query and document jointly, lifting relevance 10-30%.",
]
# Index. BM25 wants tokenized docs; lowercase + whitespace split is the honest minimum.
tokenized = [doc.lower().split() for doc in CORPUS]
bm25 = BM25Okapi(tokenized)
def retrieve(query: str, k: int = 3) -> list[str]:
scores = bm25.get_scores(query.lower().split())
ranked = sorted(zip(scores, CORPUS), key=lambda x: x[0], reverse=True)
return [doc for score, doc in ranked[:k] if score > 0] # drop zero-score noise
client = anthropic.Anthropic()
def answer(query: str) -> str:
chunks = retrieve(query)
if not chunks:
return "No relevant context found." # retrieval failed — don't guess
context = "\n".join(f"[{i + 1}] {c}" for i, c in enumerate(chunks))
prompt = (
"Answer ONLY from the numbered context below. "
"If the answer is not present, say so explicitly. "
"Cite the chunk number(s) you used.\n\n"
f"Context:\n{context}\n\nQuestion: {query}"
)
resp = client.messages.create(
model="claude-sonnet-4-6", # high-volume RAG answer gen; opus-4-8 for harder synthesis
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return next(b.text for b in resp.content if b.type == "text")
if __name__ == "__main__":
print(answer("What k does RRF use, and why does hybrid beat dense alone?"))What's worth noticing, because it's load-bearing in real systems:
Setup and corpus (lines 132–149): Import the client and tokenizer; define a tiny in-memory corpus of 5 fact chunks. In production, this comes from an ingest+chunk pipeline and lives in a vector DB; here it's flat strings for speed.
Indexing (lines 151–153): Tokenize each document (lowercase + whitespace split) and feed it to BM25Okapi, which learns term frequencies. This is the one-time offline step — it doesn't run per query.
Retrieval function (lines 156–159): retrieve() takes the user query, tokenizes it the same way, asks BM25 for scores, sorts by score descending, and returns the top-k chunks with non-zero scores. The if score > 0 guard prevents garbage matches.
Grounding + citation (lines 165–181): answer() calls retrieve(), formats chunks with numbers [1], [2], ..., and sends a prompt that explicitly forbids the model from using outside knowledge. The instruction "cite the chunk numbers you used" makes the model show its work. The next(b.text ...) unpacks the response.
The whole flow: User asks → retrieve top-3 chunks → stamp them [1][2][3] → send "answer ONLY from these" + question → model grounds answer in chunks and cites them. What it achieves: a tiny but real end-to-end RAG that stops hallucination by cutting off the model's ability to rely on memorized facts.
if not chunks guard is your first defense against the model falling back on parametric recall when retrieval whiffs.claude-sonnet-4-6 is the workhorse for high-volume RAG answer generation; escalate to claude-opus-4-8 for multi-hop synthesis, or drop to claude-haiku-4-5 for simple extractive lookups. The retrieval layer doesn't change — only the generator.sentence-transformers / BGE-M3) and fusing with RRF — exactly what /rag/hybrid-search-and-rerank covers.| Stage | Cost | Latency | Quality lever | Primary failure mode |
|---|---|---|---|---|
| Chunking | One-time compute | Offline | Boundary strategy (recursive, semantic, parent-child) | Orphaned context; severed tables/code |
| Embedding | $/1M tokens at index time | Offline | Model (text-embedding-3-large, Cohere v4, BGE-M3); dims via Matryoshka | Domain mismatch; stale index after model swap |
| Vector index | RAM/disk for vectors | <10ms (HNSW) | HNSW vs IVF-PQ; recall/latency/memory knobs | Low recall from aggressive ANN params |
| Sparse (BM25) | Cheap | <10ms | Tokenization, field boosting | Misses paraphrase entirely |
| Fusion (RRF) | Negligible | ~0 | k constant (≈60) |
None significant — robust by design |
| Rerank | Per-query model call | +100–400ms | Cross-encoder over top-50/100 | Cost/latency at high QPS |
| Generation | $/1M in+out tokens | +500ms–several s | Model tier; context size | Lost-in-the-middle; ungrounded answers |
What changes at scale. At small scale, naive RAG in pgvector is fine and cheap. As the corpus and QPS grow, three things bite. First, index economics: at tens of millions of vectors, full-precision HNSW gets expensive in RAM, so you move to int8/binary quantization (Cohere embed v4 and BGE-M3 support this natively) and/or IVF-PQ, trading a few points of recall for an order of magnitude less memory — and you pick a vector DB built for it (Qdrant, Milvus, Turbopuffer, LanceDB). Second, reranking cost: a cross-encoder over 100 candidates per query is fine at 1 QPS and ruinous at 1000 QPS, so you either cache, narrow the candidate set, or move to a late-interaction model (ColBERT-style, near cross-encoder quality at much lower latency). Third, freshness and access control: ingest pipelines need incremental reindexing and per-document ACLs enforced as metadata filters at retrieval time, not as a post-hoc check.
The dominant failure modes, ranked: (1) retrieval recall — the right chunk isn't in the index or isn't retrieved; (2) chunking — the right info exists but was split badly; (3) ranking — the right chunk is retrieved but buried below junk the reranker should have caught; (4) generation — only after the first three are good does the model itself become the limiting factor. Debug in that order.
ingest → chunk → embed → index, online query → rewrite → retrieve(dense+sparse) → fuse → rerank → assemble → generate → cite. Then say the quiet part: it breaks at retrieval far more than at generation. Name the top culprits — dense-only missing exact tokens (fix with hybrid + BM25), bad chunk boundaries, and no reranking — and note you'd build a golden eval set first so you can localize regressions. Naming real tools (pgvector/Qdrant, RRF, a cross-encoder reranker, Claude for generation) signals you've actually shipped one.context recall. If recall is low, the bug is upstream (chunking, dense-only retrieval, query phrasing), and no prompt change fixes it. If recall is fine but the answer still drifts, measure faithfulness — claims unsupported by context — and tighten the grounding prompt, add citation enforcement, or trim context to avoid lost-in-the-middle. The headline: most "hallucination" is retrieval failure in disguise, so I fix retrieval first. (See /rag/evaluation.)Flashcard. RAG is a retrieval bet, not a generation bet. The pipeline is
ingest→chunk→embed→indexoffline andquery→rewrite→retrieve(dense+sparse)→fuse→rerank→assemble→generate→citeonline. Naive RAG (cosine→stuff) is a baseline, not a product; advanced RAG adds hybrid + rerank + query transforms; agentic RAG lets the model decide whether/what/when to retrieve. Debug retrieval before generation. RAG (knowledge), fine-tuning (behavior), and long-context (whole-doc reasoning) compose — they don't compete.
Next: Chunking & Embeddings → — the offline half of the pipeline, where most retrieval quality is won or lost.