IC3IC4IC5

RAG Foundations

RAG is a retrieval bet, not a generation bet — learn the full ingest→retrieve→rerank→generate pipeline, why naive RAG dies in production, and when to reach for RAG vs fine-tuning vs a 1M-token window.

16 min read · 13 sections

Prerequisites: embeddings & cosine similarity, basic LLM prompting, what a vector index is

Runnable: ai-eng-wiki/examples/rag/mini_rag.py

1. Quick anchor

Retrieval-Augmented Generation is the discipline of putting the right tokens in front of the model at inference time instead of baking knowledge into weights. The model is a fixed reasoning engine; RAG is the supply chain that feeds it fresh, large, private, and attributable facts on demand. The mental model that separates juniors from staff: RAG is a retrieval problem wearing a generation costume. Retrieval is almost always the bottleneck — if the right chunk never reaches the prompt, no amount of model intelligence recovers it. Everything in the RAG pillar is a refinement of one question: given a query, how do I reliably surface the small set of passages that actually answer it? This lesson sets up the whole pillar — the pipeline, the naive→advanced→agentic progression, and the RAG-vs-fine-tune-vs-long-context decision.

2. Why interviewers probe this

RAG is the single most common system-design prompt for applied AI roles because it touches retrieval, embeddings, latency budgets, evals, and the model itself. Interviewers use it to calibrate altitude: can you reason about the whole system, or only the part you've touched?

At IC4, the tells are:

You can draw the full pipeline (ingest → chunk → embed → index → retrieve → rerank → generate) without prompting, and you name real tools at each stage.
You know naive cosine-over-chunks is a starting point, not a destination, and can say why it fails (exact-match misses, chunk boundaries, no reranking).
You instinctively separate "retrieval failed" from "generation failed" when debugging.

At IC5, the tells are:

You quantify tradeoffs: "reranking adds ~100–400ms and a per-query cost but typically lifts relevance 10–30%, so I gate it behind a top-50 candidate set."
You design the evaluation before the system, and you regression-gate retrieval and generation separately.
You reach for hybrid (dense + BM25) by reflex and can justify the fusion math.

At IC6/staff, the tells are:

You frame RAG as one option in a portfolio — RAG vs fine-tune vs long-context vs agentic retrieval — and pick based on knowledge volatility, access control, attribution requirements, and unit economics.
You can argue against RAG (when long-context or fine-tuning dominates) and know the failure modes of each.
You think about the org: who owns the golden eval set, how retrieval quality is monitored in prod, how access control rides along with retrieval.

3. Concept build-up

3.1 Why RAG exists at all

Beginner explainerNew here? RAG in 60 seconds

The core idea. Retrieval-Augmented Generation (RAG) is the practice of retrieving relevant documents or passages from a corpus and putting those passages into the prompt before asking the model a question. Instead of baking all knowledge into the model's weights during training, you keep knowledge outside the model and fetch it at query time. The why.

Parametric knowledge (baked into weights) is stale, enormous, and impossible to cite.
Non-parametric knowledge (in a retrievable index) is fresh, huge, private, and comes with a source. The key mental model: RAG is a retrieval problem, not a generation problem. If the right passage never reaches the prompt, no prompt magic recovers it. Step by step.

Parse documents and split them into passages (chunks).
Turn each chunk into a vector (embedding) — a list of numbers capturing meaning.
Store these vectors in a searchable database.
When a user asks a question, search for the chunks most similar to that question.
Take the top few matches and stuff them into the prompt.
Ask the model to answer using only those passages.
Cite the passages the model used. Remember this: The bottleneck is almost always retrieval, not generation. If you're blaming the model for "hallucinating," first check whether the right passage was even retrieved. Nine times out of ten it wasn't.

Pre-training and fine-tuning write knowledge into weights. That is a terrible substrate for four kinds of knowledge:

Fresh — anything that changed after the training cutoff. Weights are a snapshot; the world is not.
Large — corpora far bigger than any context window or training budget (millions of documents, your entire Confluence, every support ticket).
Private / access-controlled — tenant data you cannot and must not train a shared model on, and where who is allowed to see which document must be enforced at query time.
Attributable — answers that must cite a source, because the user (or a regulator, or a lawyer) needs to verify the claim against ground truth.

The original framing comes from Lewis et al., 2020, who paired a parametric model (the weights) with a non-parametric memory (a retrievable index) and showed it beat closed-book models on knowledge-intensive tasks. The deeper reason RAG reduces hallucination: you change the model's job from recall ("what do I remember about X?") to reading comprehension ("here are five passages, answer from them"). Reading comprehension is a far more reliable behavior than parametric recall, and it gives you a citation for free. See /rag/evaluation for how we measure that as faithfulness — the fraction of claims in the answer supported by the retrieved context.

3.2 The full pipeline, end to end

There are two halves: an offline ingest path and an online query path.

Ingest (offline, runs when documents change):

ingest → chunk → embed → index

Ingest: parse PDFs, HTML, Markdown, code, transcripts into clean text + metadata (source, title, timestamp, ACL/tenant, section headings).
Chunk: split into retrievable units. This is where most quality is won or lost — covered in depth in /rag/chunking-and-embeddings.
Embed: turn each chunk into a dense vector with a bi-encoder.
Index: store vectors (and the original text + metadata) in a vector DB for approximate nearest-neighbor (ANN) search.

Query (online, runs per request):

query → (rewrite) → retrieve (dense ± sparse) → fuse → rerank → assemble context → generate → (cite)

(Rewrite): optionally transform the user query — expand abbreviations, decompose multi-part questions, or generate a hypothetical answer to embed (HyDE). See /rag/advanced-retrieval.
Retrieve: pull top-k candidates by dense vector similarity and/or sparse lexical match (BM25).
Fuse: merge the dense and sparse ranked lists (Reciprocal Rank Fusion).

✎ Reciprocal Rank Fusion (RRF) — on real numbers

What it does: RRF combines two ranked lists (one from dense search, one from sparse BM25) into a single list that leverages both. The formula: For each chunk, sum up 1 / (k + rank) where rank is its position in each list, and k is a constant (typically 60). Higher score wins. Concrete example: Say dense search ranked a chunk at position 1, and BM25 ranked it at position 5. With k = 60:

Dense contribution: 1 / (60 + 1) = 1/61 ≈ 0.016
BM25 contribution: 1 / (60 + 5) = 1/65 ≈ 0.015
Total RRF score: 0.016 + 0.015 = 0.031

Now say another chunk was dense rank 10 but BM25 rank 2:

Dense contribution: 1 / (60 + 10) = 1/70 ≈ 0.014
BM25 contribution: 1 / (60 + 2) = 1/62 ≈ 0.016
Total RRF score: 0.014 + 0.016 = 0.030

The first chunk (0.031 > 0.030) ranks higher in the fused list. What just happened: RRF penalized the second chunk for being buried in dense search despite ranking high in sparse — it rewards chunks that appear near the top in both lists.

Rerank: re-score the merged candidate set with a cross-encoder and keep the top 3–8.
Assemble: format the surviving chunks into the prompt with provenance markers.
Generate: the model answers, ideally constrained to the supplied context.
(Cite): attach source references to the answer.

✎ The full RAG pipeline — how context shrinks and reorders

At each stage, the candidate set gets smaller and more precise.

Retrieve (Dense): 1000 candidates ranked by cosine similarity to query vector.
Retrieve (Sparse): top 1000 documents by BM25 (lexical term match).
Fuse (RRF): merge the two lists → ~500 candidates, reranked by both signals.
Rerank (Cross-encoder): score top 50 with a cross-encoder that sees query ↔ document interactions → top 8.
Assemble: format those 8 passages into the prompt with source markers.
Generate: model answers from only those 8 passages.
Cite: model outputs "[1], [3], [7]" for the chunks it used.

Cost vs. quality: Early stages are cheap (vector DB + BM25 lookups <10ms), reranking is expensive (~100–400ms), and generation dominates overall latency. The trick: spend early-stage compute to narrow aggressively so generation runs on a tightly focused context.

◐ InteractiveRAG pipeline, stage by stage

Query: “How much did contextual retrieval cut failed retrievals?”

1. Query2. Dense retrieval3. BM25 (sparse)4. RRF fusion5. Rerank (cross-encoder)6. Generate

C1Contextual Retrieval prepends chunk-specific context before embedding and BM25 indexing.

C2It cut failed retrievals by ~49% (contextual embeddings + BM25); ~67% with reranking added.

C3BM25 is a sparse lexical method matching exact terms and rare tokens like codes/IDs.

C4Dense embeddings capture semantic similarity but can miss exact keyword matches.

C5Reciprocal Rank Fusion merges ranked lists with score 1/(k+rank), k≈60.

C6GraphRAG builds a knowledge graph + community summaries for multi-hop questions.

→ The user's question — note it asks 'how much', a number.

Step through the interactive pipeline above — watch a single query flow from rewrite to retrieval to fusion to rerank to generation, and notice how the candidate set shrinks and reorders at each stage. The thing to internalize: each stage trades cost/latency for recall or precision, and you tune the pipeline as a budget, not a checklist.

3.3 The naive → advanced → agentic progression

This is the spine of the whole pillar, and a favorite interview arc.

Naive RAG = chunk → embed → cosine → stuff the top-k into the prompt. It is the right thing to build first to get a baseline, and it is dead on arrival for most production workloads. Why it fails:

Dense-only retrieval misses exact tokens. A bi-encoder embeds meaning, so it's strong on paraphrase but weak on the literal: error codes (ERR_2043), SKUs, function names, person names, acronyms. Ask for get_user_by_id and a dense retriever may hand you fetch_account_details because they're semantically close. BM25 — a sparse, term-frequency ranker — nails exact and rare tokens but misses paraphrase. Neither alone is enough; hybrid wins (see /rag/hybrid-search-and-rerank).

✎ Dense vs. Sparse — on real queries

Dense (bi-encoder embeddings):

Query "how do I get the current user ID?" → dense vector ≈ [0.2, 0.8, -0.1, ...].
Chunk "Call fetch_account_details(acct_id) to retrieve the user object." → dense vector ≈ [0.19, 0.82, -0.09, ...] (semantically similar!).
Cosine similarity ≈ 0.98 → Dense ranks this HIGH even though it doesn't have the exact token get_user_by_id.

Sparse (BM25):

Same query tokenized: ["how", "do", "i", "get", "current", "user", "id"].
Same chunk tokenized: ["call", "fetch_account_details", "acct_id", "retrieve", "user", "object"].
Token overlap: only ["user"].
BM25 score: LOW because rare terms like get and id don't appear.
But a different chunk: "Use get_user_by_id(user_token) to fetch the active user." has ["get", "user", "id"] → BM25 ranks this HIGH.

Hybrid (Dense + BM25 fused):

First chunk gets dense boost, second chunk gets sparse boost.
RRF merges them: second chunk climbs to the top because it nails exact tokens.
Result: Retrieve both semantic paraphrases AND exact-match facts.

No reranking. Top-k by cosine is a coarse filter. The genuinely best passage is often at rank 12, not rank 1.
Chunk boundaries destroy context. A naive fixed-size split orphans the sentence that resolves a pronoun, or severs a table from its caption.
Lost in the middle. Even when you retrieve well, stuffing 20 chunks degrades quality — models attend most to the start and end of long contexts and "lose" the middle (Liu et al., 2023). More context is not more better.

Advanced RAG fixes these with three families of technique: hybrid retrieval (dense + sparse fused with RRF), a reranking stage (cross-encoder over a top-50/100 candidate set), and query transforms (rewriting, decomposition, HyDE, multi-query). Anthropic's Contextual Retrieval (Sept 2024) is a clean exemplar: use an LLM to prepend a short, chunk-specific blurb situating each chunk in its parent document before embedding and BM25 indexing. Anthropic reported this cut failed retrievals by ~49% (contextual embeddings + contextual BM25), and ~67% once reranking was added. Prompt caching makes the per-chunk LLM annotation cheap because the full document sits in the cached prefix.

Agentic RAG removes the assumption that you retrieve exactly once. An agent decides whether to retrieve at all, what to retrieve, reformulates when the first pass is thin, routes among multiple sources, retrieves iteratively, and self-checks groundedness before answering. This ties directly into the /agents pillar and tool use / MCP: retrieval becomes a tool the model calls, not a fixed preprocessing step. It's more capable and more expensive — multiple model round-trips per answer — so you reserve it for multi-hop and open-ended questions. The progression is a cost/quality ladder; you climb it only as far as your eval and your latency budget demand.

3.4 Where the bottleneck actually lives

Internalize this and you'll debug RAG correctly for the rest of your career: retrieval is the failure, generation is the symptom. When a RAG system "hallucinates," nine times out of ten the right chunk never made it into the context, so the model fell back on parametric recall. The fix is upstream — better chunking, hybrid search, reranking, query rewriting — not a sterner prompt. This is why /rag/evaluation insists on measuring retrieval and generation separately: recall@k / nDCG / MRR for the retriever, faithfulness / answer relevancy for the generator. A system with 60% context recall cannot exceed 60% answerable questions no matter how good the model is.

3.5 RAG vs fine-tuning vs long-context

These are not competitors; they are complementary tools that compose. The staff-level answer always starts by refusing the false dichotomy.

RAG is for knowledge — fresh, large, private, attributable. It's the only option when the corpus changes faster than you can train and when access control must be enforced per query.
Fine-tuning (see /finetuning) is for behavior — format, tone, domain skill, tool-calling reliability, structured-output adherence. Fine-tune to teach the model how to act, not what to know. Trying to fine-tune in volatile facts is an anti-pattern: it's expensive, stale the moment training ends, and unattributable.
Long-context (1M-token windows on claude-opus-4-8 and claude-sonnet-4-6; see /context-engineering) is for whole-document reasoning — when the relevant material fits and cross-references span the whole doc. But it carries real costs: you pay for every token every call, latency scales with input, and "lost in the middle" still bites. A 1M-token window is not a license to skip retrieval; it's a license to retrieve coarser (fewer, larger chunks) and let the model do more in-context reasoning.

The composition in practice: RAG narrows millions of documents to the few thousand tokens that matter, long-context lets each retrieved unit be a large parent section instead of a tiny snippet, and a light fine-tune locks in citation format and refusal behavior. They stack.

◇ Live illustrationThe RAG pipeline, end to end

Documents are chunked, embedded and indexed offline; at query time the system retrieves, fuses, reranks, and grounds the answer. Watch the data flow.

4. Minimal implementation

A tiny but real end-to-end RAG: BM25 sparse retrieval over a small corpus, then a Claude generation step constrained to the retrieved context. This is deliberately sparse-only and in-memory so it runs in seconds — the full version (dense + hybrid + rerank + eval) lives at examples/rag/mini_rag.py.

"""Tiny end-to-end RAG: BM25 retrieval + Claude generation, grounded and cited.
    pip install rank_bm25 anthropic   # ANTHROPIC_API_KEY in env
Full hybrid+rerank+eval version: examples/rag/mini_rag.py
"""
import anthropic
from rank_bm25 import BM25Okapi
 
# In production this CORPUS is the output of an ingest+chunk pipeline,
# each item carrying source/title/ACL metadata. Here: five flat chunks.
CORPUS = [
    "Anthropic's Contextual Retrieval prepends chunk-specific context before "
    "embedding and BM25 indexing, cutting failed retrievals by ~49%.",
    "Reciprocal Rank Fusion merges ranked lists: score = sum of 1/(k+rank), k about 60.",
    "BM25 is a sparse lexical ranker, strong on exact terms, codes, and rare tokens.",
    "Dense bi-encoder embeddings capture paraphrase but miss exact-token matches.",
    "Cross-encoder rerankers score query and document jointly, lifting relevance 10-30%.",
]
 
# Index. BM25 wants tokenized docs; lowercase + whitespace split is the honest minimum.
tokenized = [doc.lower().split() for doc in CORPUS]
bm25 = BM25Okapi(tokenized)
 
 
def retrieve(query: str, k: int = 3) -> list[str]:
    scores = bm25.get_scores(query.lower().split())
    ranked = sorted(zip(scores, CORPUS), key=lambda x: x[0], reverse=True)
    return [doc for score, doc in ranked[:k] if score > 0]  # drop zero-score noise
 
 
client = anthropic.Anthropic()
 
 
def answer(query: str) -> str:
    chunks = retrieve(query)
    if not chunks:
        return "No relevant context found."  # retrieval failed — don't guess
    context = "\n".join(f"[{i + 1}] {c}" for i, c in enumerate(chunks))
    prompt = (
        "Answer ONLY from the numbered context below. "
        "If the answer is not present, say so explicitly. "
        "Cite the chunk number(s) you used.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = client.messages.create(
        model="claude-sonnet-4-6",  # high-volume RAG answer gen; opus-4-8 for harder synthesis
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return next(b.text for b in resp.content if b.type == "text")
 
 
if __name__ == "__main__":
    print(answer("What k does RRF use, and why does hybrid beat dense alone?"))

What's worth noticing, because it's load-bearing in real systems:

✎ Minimal RAG code — section by section

Setup and corpus (lines 132–149): Import the client and tokenizer; define a tiny in-memory corpus of 5 fact chunks. In production, this comes from an ingest+chunk pipeline and lives in a vector DB; here it's flat strings for speed.

Indexing (lines 151–153): Tokenize each document (lowercase + whitespace split) and feed it to BM25Okapi, which learns term frequencies. This is the one-time offline step — it doesn't run per query.

Retrieval function (lines 156–159): retrieve() takes the user query, tokenizes it the same way, asks BM25 for scores, sorts by score descending, and returns the top-k chunks with non-zero scores. The if score > 0 guard prevents garbage matches.

Grounding + citation (lines 165–181): answer() calls retrieve(), formats chunks with numbers [1], [2], ..., and sends a prompt that explicitly forbids the model from using outside knowledge. The instruction "cite the chunk numbers you used" makes the model show its work. The next(b.text ...) unpacks the response.

The whole flow: User asks → retrieve top-3 chunks → stamp them [1][2][3] → send "answer ONLY from these" + question → model grounds answer in chunks and cites them. What it achieves: a tiny but real end-to-end RAG that stops hallucination by cutting off the model's ability to rely on memorized facts.

The numbered context + "cite the chunk" instruction is the cheapest path to attribution and the foundation of measuring faithfulness later.
The explicit "if not present, say so" plus the if not chunks guard is your first defense against the model falling back on parametric recall when retrieval whiffs.
Model choice is a deliberate cost lever. claude-sonnet-4-6 is the workhorse for high-volume RAG answer generation; escalate to claude-opus-4-8 for multi-hop synthesis, or drop to claude-haiku-4-5 for simple extractive lookups. The retrieval layer doesn't change — only the generator.
This is sparse-only. The very next step toward production is adding a dense retriever (e.g. sentence-transformers / BGE-M3) and fusing with RRF — exactly what /rag/hybrid-search-and-rerank covers.

5. Production tradeoffs

Stage	Cost	Latency	Quality lever	Primary failure mode
Chunking	One-time compute	Offline	Boundary strategy (recursive, semantic, parent-child)	Orphaned context; severed tables/code
Embedding	$/1M tokens at index time	Offline	Model (text-embedding-3-large, Cohere v4, BGE-M3); dims via Matryoshka	Domain mismatch; stale index after model swap
Vector index	RAM/disk for vectors	<10ms (HNSW)	HNSW vs IVF-PQ; recall/latency/memory knobs	Low recall from aggressive ANN params
Sparse (BM25)	Cheap	<10ms	Tokenization, field boosting	Misses paraphrase entirely
Fusion (RRF)	Negligible	~0	`k` constant (≈60)	None significant — robust by design
Rerank	Per-query model call	+100–400ms	Cross-encoder over top-50/100	Cost/latency at high QPS
Generation	$/1M in+out tokens	+500ms–several s	Model tier; context size	Lost-in-the-middle; ungrounded answers

What changes at scale. At small scale, naive RAG in pgvector is fine and cheap. As the corpus and QPS grow, three things bite. First, index economics: at tens of millions of vectors, full-precision HNSW gets expensive in RAM, so you move to int8/binary quantization (Cohere embed v4 and BGE-M3 support this natively) and/or IVF-PQ, trading a few points of recall for an order of magnitude less memory — and you pick a vector DB built for it (Qdrant, Milvus, Turbopuffer, LanceDB). Second, reranking cost: a cross-encoder over 100 candidates per query is fine at 1 QPS and ruinous at 1000 QPS, so you either cache, narrow the candidate set, or move to a late-interaction model (ColBERT-style, near cross-encoder quality at much lower latency). Third, freshness and access control: ingest pipelines need incremental reindexing and per-document ACLs enforced as metadata filters at retrieval time, not as a post-hoc check.

The dominant failure modes, ranked: (1) retrieval recall — the right chunk isn't in the index or isn't retrieved; (2) chunking — the right info exists but was split badly; (3) ranking — the right chunk is retrieved but buried below junk the reranker should have caught; (4) generation — only after the first three are good does the model itself become the limiting factor. Debug in that order.

6. How it's asked

[IC4] "Sketch a production RAG pipeline and tell me where it breaks." Draw both halves: offline ingest → chunk → embed → index, online query → rewrite → retrieve(dense+sparse) → fuse → rerank → assemble → generate → cite. Then say the quiet part: it breaks at retrieval far more than at generation. Name the top culprits — dense-only missing exact tokens (fix with hybrid + BM25), bad chunk boundaries, and no reranking — and note you'd build a golden eval set first so you can localize regressions. Naming real tools (pgvector/Qdrant, RRF, a cross-encoder reranker, Claude for generation) signals you've actually shipped one.

[IC4] "Why isn't cosine similarity over chunks enough?" Because a bi-encoder embeds meaning, so it's strong on paraphrase and weak on the literal — codes, IDs, names, function signatures — which is exactly what a lot of real queries hinge on. BM25 covers that gap; fusing the two with RRF gets you both. Cosine top-k is also a coarse ranker — the best passage is often not rank 1 — so you add a cross-encoder reranker over a larger candidate set. That's the naive→advanced jump.

[IC5] "A customer says RAG hallucinates on their docs. Diagnose it." First, separate retrieval from generation. I'd take the failing queries, check whether the gold passage is even in the retrieved set — that's context recall. If recall is low, the bug is upstream (chunking, dense-only retrieval, query phrasing), and no prompt change fixes it. If recall is fine but the answer still drifts, measure faithfulness — claims unsupported by context — and tighten the grounding prompt, add citation enforcement, or trim context to avoid lost-in-the-middle. The headline: most "hallucination" is retrieval failure in disguise, so I fix retrieval first. (See /rag/evaluation.)

[IC5] "Reranking costs latency and money. Justify it." A cross-encoder scores query and document jointly, so it sees interactions a bi-encoder can't, typically lifting relevance 10–30%. The cost is a per-query model call adding ~100–400ms. I make it pay by gating it: dense+sparse retrieve a top-50/100 cheaply, the reranker compresses that to the top 5–8 that actually enter the prompt. That smaller, higher-precision context also lowers generation cost and dodges lost-in-the-middle, so reranking often pays for itself downstream. At high QPS I'd consider a late-interaction model (ColBERT/PLAID) for most of the cross-encoder quality at a fraction of the latency.

[IC6] "1M-token context exists. Do we still need retrieval?" For a single bounded document that fits, often no — long-context wins on whole-doc reasoning and is simpler. But retrieval still wins whenever the corpus exceeds the window (millions of docs), when you need per-query access control (you can't dump another tenant's data into the prompt), when attribution to specific sources is required, and on unit economics — paying for 1M tokens every call is brutal at scale versus retrieving a few thousand relevant ones. And lost-in-the-middle means a stuffed window underperforms a focused one. The mature stance: they compose — retrieval narrows the field, long-context lets each retrieved unit be a large parent section, and you tune the split by cost and latency budget. (See /context-engineering.)

7. Pitfalls & flashcards

Treating RAG as a generation problem. You'll waste weeks tuning prompts when the right chunk never reached the model. Instrument retrieval recall first.
Dense-only retrieval. Ships fine in the demo, fails on the first query with an error code or product SKU. Add BM25 and fuse.
Skipping reranking because "cosine top-k looks fine." It looks fine until you measure nDCG. The best passage is rarely rank 1.
Over-stuffing context. More chunks ≠ better answers; lost-in-the-middle and cost both punish you. Retrieve broadly, rerank hard, pass few.
No golden eval set. Without labeled query→relevant-chunk pairs you cannot tell retrieval failure from generation failure, and you cannot regression-gate changes in CI.
Fine-tuning to inject facts. Wrong tool — facts go stale and lose attribution. Fine-tune behavior, retrieve knowledge.
Forgetting access control. ACLs must be retrieval-time metadata filters, not a hope. A retrieval leak is a data breach.
Re-embedding asymmetry. Swapping the embedding model means reindexing the entire corpus; query and document vectors must come from the same model.

Flashcard. RAG is a retrieval bet, not a generation bet. The pipeline is ingest→chunk→embed→index offline and query→rewrite→retrieve(dense+sparse)→fuse→rerank→assemble→generate→cite online. Naive RAG (cosine→stuff) is a baseline, not a product; advanced RAG adds hybrid + rerank + query transforms; agentic RAG lets the model decide whether/what/when to retrieve. Debug retrieval before generation. RAG (knowledge), fine-tuning (behavior), and long-context (whole-doc reasoning) compose — they don't compete.

8. Further reading

Lewis et al., 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the original parametric + non-parametric framing.
Anthropic — Introducing Contextual Retrieval (Sept 2024) — contextual embeddings + contextual BM25 + reranking, with the ~49% / ~67% failed-retrieval reductions and the prompt-caching cost trick.
Liu et al., 2023 — Lost in the Middle — why stuffing more context degrades quality.
Microsoft — GraphRAG — entity/relationship extraction → community summaries for global, multi-hop "sense-making" questions a flat vector search can't answer.
RAGAS — Automated Evaluation of RAG — faithfulness, answer relevancy, context precision/recall as LLM-as-judge metrics.

Next: Chunking & Embeddings → — the offline half of the pipeline, where most retrieval quality is won or lost.

Primary sources

← More in RAG & Retrieval