The techniques that turn a demo RAG into a production one — contextual retrieval, query transforms, GraphRAG, and agentic retrieval — and the cost/latency/quality math that tells you which to reach for.
Naive RAG is chunk → embed → cosine → stuff. Advanced retrieval is everything you bolt on once you measure recall on real queries and discover the embedding model is silently dropping a third of the relevant chunks. The mental model: retrieval is the bottleneck, not generation — a frontier model can only reason over what you put in its context, so the highest-leverage engineering is upstream of the LLM. This lesson covers the four moves that consistently move the needle in 2026: Contextual Retrieval (use an LLM to situate each chunk before indexing), query transforms (rewrite/expand/decompose the question), GraphRAG (build a knowledge graph for global and multi-hop questions a flat index can't answer), and agentic RAG (let the model decide whether, what, and when to retrieve). The staff-level skill is not knowing these exist — it's knowing the cost/latency/quality tradeoff of each well enough to not deploy GraphRAG when hybrid + rerank would have done.
This is the topic that separates "I followed a LangChain tutorial" from "I have run a retrieval system that real users depend on." The interviewer wants to see whether you reach for complexity reflexively or diagnostically.
At IC4, the tells are:
At IC5, additionally:
At IC6, additionally:
The words first.
Step by step.
LLM reads the whole document and writes one short context sentence ("This is from Acme's 2024 Q3 report...") that is prepended to each chunk before indexing.LLM scans the docs and extracts entities and how they relate, building a graph (nodes = things, edges = relationships).multi-hop or "summarize the whole corpus" question, the system walks that graph and gathers connected facts no single chunk holds.Remember this: contextual retrieval makes each chunk findable by adding its missing context, while GraphRAG links facts into a graph so you can answer broad, multi-step questions.
A bi-encoder maps query and document independently into a vector space and scores by cosine/dot product. That single vector is a lossy summary. Two failure modes follow directly:
ORA-01555?" — the string ORA-01555 carries almost all the information, but a 1024-dim float vector smears it into a neighborhood of "database error" chunks. Sparse retrieval (BM25) nails exact tokens, rare terms, codes, and IDs because it scores on term frequency, not semantics.Hybrid search (covered in depth here) fixes (1). Contextual Retrieval fixes (2). They compose.
The insight is almost embarrassingly simple: before you embed or BM25-index a chunk, prepend a short, chunk-specific blurb that situates it in its parent document — generated by an LLM that has read the whole document. The chunk above becomes:
This chunk is from the Limitation of Liability section (§9.2) of the Acme–Globex Master Services Agreement dated 2024. The cap shall not exceed 18 months of fees.
Now "Acme MSA liability cap" retrieves it on both the dense and sparse paths. Anthropic's published results: contextual embeddings + contextual BM25 reduced failed retrievals (top-20) by ~49%, and ~67% when combined with reranking. Those are among the most reliable public numbers in RAG — attribute them in an interview, don't round them up.
The architecture has three independent levers, and you want all three:
context + chunk.context + chunk in the sparse index too.score = Σ 1/(k + rank), k≈60, from Cormack 2009 — robust because it never touches the raw, uncalibrated scores), then run a cross-encoder reranker (Cohere Rerank 3, bge-reranker-v2, mxbai-rerank) to cut to top-5.RRF combines two ranked lists (dense and sparse) without touching their raw scores — they're on different scales and can't be directly added. Instead, for each document at each position (rank 0, 1, 2, ...), you give it a score of 1 / (k + rank). The k is a constant (typically k = 60, from Cormack's research) that smooths the impact of early ranks. Then sum these contributions across both lists. For example, if a document appears at rank 2 in the dense list and rank 5 in the sparse list, with k = 60: it gets 1 / (60 + 2) ≈ 0.0161 from dense plus 1 / (60 + 5) ≈ 0.0149 from sparse, totaling ≈ 0.0310. Documents ranked 0 (first place) in both lists score 1 / 60 ≈ 0.0167 each for ≈ 0.0334. The rank comes first (0-indexed), so rank 0 is position 1 in human terms — documents that appear high in both lists bubble to the top of the fused result.
Why it works: you never compare the raw cosine scores or BM25 scores directly (they have different ranges and meanings), only the ordinal position, so calibration doesn't matter. Both a perfect confidence score of 0.999 and a weak 0.501 vote equally on position — robustness is the goal.
Why the per-chunk LLM call is affordable: prompt caching. Naively, generating context for each of a document's N chunks means feeding the whole document to the LLM N times — quadratic in document size, and ruinous. Prompt caching collapses this. You cache the document once (paying a ~1.25× write premium for the 5-minute TTL), then each per-chunk call re-reads the cached document at ~0.1× the input price and only pays full price for the tiny chunk + instruction + ~50-token output. Anthropic reported a one-time cost of roughly $1.02 per million document tokens to generate contextual chunks this way. The mechanism matters: prompt caching is a prefix match, so the document must sit at the front of the prompt (the stable prefix) and the per-chunk instruction at the end (the volatile suffix). Put a timestamp or chunk index ahead of the document and you invalidate the cache on every call — your bill 10×'s silently and cache_read_input_tokens reads zero.
The query the user typed is rarely the optimal query for the index. Transforms reshape it:
The honest tradeoff: every transform adds at least one LLM round-trip before retrieval. On a latency budget of 300 ms that's often unacceptable; on a research assistant tolerating 10 s it's free money. Measure recall lift per added millisecond.
Flat vector search answers local questions: "What is X?" It cannot answer global, sense-making questions: "What are the major themes across these 10,000 incident reports?" No single chunk contains the answer; the answer is a property of the whole corpus.
GraphRAG addresses this by, at ingestion time, using an LLM to extract entities and relationships from every chunk into a knowledge graph, running community detection (the Leiden algorithm) to cluster the graph hierarchically, and then generating community summaries at each level. A global query is answered by map-reducing over relevant community summaries rather than over raw chunks. This genuinely unlocks multi-hop ("which engineers touched services that depend on the auth module?") and thematic queries that flat RAG simply gets wrong.
The cost is real and it's where IC5/IC6 candidates earn their stripes: graph construction is an LLM call per chunk (entity/relation extraction) plus summary generation per community — easily 5–20× the ingestion cost of a vector index, and it must be re-run or incrementally patched as the corpus changes. Reach for GraphRAG only when your eval set contains genuinely global/multi-hop questions that hybrid + rerank measurably fails. For the common case — "find the passage that answers this question" — it's expensive overkill. Lighter variants exist: LightRAG (dual-level retrieval, cheaper incremental updates) and agentic graph traversal, where the agent walks the graph on demand instead of pre-summarizing every community.
The pipelines above are static: a fixed sequence runs on every query. Agentic RAG hands control to the model. Retrieval becomes a tool the model calls (tying directly to /agents and /agents/tool-use-and-mcp — your search index is often exposed as an MCP server). The model now decides:
This is strictly more capable and strictly more dangerous. The capability: it handles multi-hop, ambiguous, and "I don't know what I'm looking for" queries that no static pipeline can. The danger: unbounded loops. An agent that keeps deciding "not enough context, retrieve again" can burn 15 LLM calls and 30 seconds on one query. In production you always impose a hard step budget (e.g. ≤4 retrieval rounds), a token budget, and a fallback to "answer with what you have + flag low confidence." With Claude, adaptive thinking lets the model decide reasoning depth per step, and a task_budget can give the loop a self-moderating token ceiling — but the harness-level step cap is non-negotiable.
The pragmatic 2026 stance: start static (hybrid + contextual + rerank), and promote to agentic only for the query classes your evals show static retrieval failing. Agentic RAG is the right default for open-ended research assistants and the wrong default for a latency-sensitive support widget.
The single highest-leverage technique here is Contextual Retrieval, and the load-bearing detail is the prompt-caching layout. Here's a real, runnable implementation using claude-haiku-4-5 (cheapest model, ample for this summarization task) to generate per-chunk context, with the parent document cached.
# pip install anthropic rank_bm25 sentence-transformers
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
# The document goes FIRST (stable cached prefix); the chunk instruction goes
# LAST (volatile suffix). Reversing these two breaks the prefix cache.
DOC_BLOCK = "<document>\n{doc}\n</document>"
CHUNK_INSTRUCTION = (
"Here is a chunk from the document above:\n"
"<chunk>\n{chunk}\n</chunk>\n\n"
"Give a short, succinct context (1-2 sentences) situating this chunk within "
"the overall document, to improve search retrieval of the chunk. "
"Answer ONLY with the context, nothing else."
)
def situate(doc: str, chunk: str):
"""Generate a retrieval-context blurb for one chunk.
The full document is cached once (cache_control on the first block); every
subsequent chunk in the same document re-reads it at ~0.1x input price.
"""
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=150,
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": DOC_BLOCK.format(doc=doc),
"cache_control": {"type": "ephemeral"}, # <-- cache the doc
},
{
"type": "text",
"text": CHUNK_INSTRUCTION.format(chunk=chunk),
},
],
}],
)
context = next(b.text for b in resp.content if b.type == "text")The situate function takes the full document and one chunk, sends them to Claude, and gets back a context blurb. Breaking it down: (1) message assembly — you build a list with two text blocks: first the document (wrapped in <document> tags and marked with cache_control: {"type": "ephemeral"}), second the chunk instruction. The document goes first so it becomes the stable cached prefix; the instruction goes second so each call varies only that tiny part. (2) API call — send to Claude Haiku (cheap, fast enough for summarization) with a small token budget (150 tokens for the context). (3) extraction — pull the text from the response (ignoring any other content types like images). (4) return — give back both the context string and the usage metrics so the caller can verify that caching worked (checking cache_read_input_tokens > 0 on the second+ call). The whole flow is wrapped in a loop in contextualize_document, which calls situate once per chunk and builds an enriched list of "context + chunk" strings ready for embedding and indexing.
return context, resp.usagedef contextualize_document(doc: str, chunks: list[str]) -> list[str]: enriched = [] for i, chunk in enumerate(chunks): context, usage = situate(doc, chunk) # Sanity-check caching on the 2nd+ call: cache_read should dominate. if i == 1: assert usage.cache_read_input_tokens > 0, "doc not cached — check prefix" enriched.append(f"{context}\n\n{chunk}") # prepend, then index THIS string return enriched
You then index the *enriched* strings on **both** paths — dense and sparse — and fuse at query time:
```python
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5") # or OpenAI / Cohere / Voyage
enriched = contextualize_document(doc, chunks)
dense = embedder.encode(enriched, normalize_embeddings=True) # contextual embeddings
bm25 = BM25Okapi([c.split() for c in enriched]) # contextual BM25
def rrf(rankings: list[list[int]], k: int = 60) -> list[int]:
scores: dict[int, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
def retrieve(query: str, top_k: int = 50) -> list[int]:
q = embedder.encode([query], normalize_embeddings=True)[0]
dense_rank = np.argsort(-(dense @ q))[:top_k].tolist()
sparse_rank = np.argsort(-bm25.get_scores(query.split()))[:top_k].tolist()
fused = rrf([dense_rank, sparse_rank])
return fused[:top_k] # -> feed top-50 to a cross-encoder reranker, cut to top-5Honest notes on this code: (1) For real ingestion of thousands of documents, run the situate calls through the Batches API at 50% off — context generation is not latency-sensitive. (2) The assert on cache_read_input_tokens is your single best defense against a silently-broken cache; keep something like it in your ingestion telemetry. (3) The reranker (e.g. FlagEmbedding's bge-reranker-v2 or cohere.rerank) is omitted for brevity but is where the gain jumps from ~49% to ~67% — don't skip it. (4) Caching has a minimum prefix (~4096 tokens on Haiku 4.5); a tiny document won't cache and the per-chunk cost stays higher — that's fine, the absolute cost is still small.
| Technique | Build/ingest cost | Query latency | Query cost | Quality lift | Main failure mode |
|---|---|---|---|---|---|
| Naive dense | Low | Low | Low | Baseline | Lexical blindness; context loss |
| Hybrid + rerank | Low | +rerank (~50–200 ms) | +rerank | Large on lexical/recall | Reranker latency at high QPS |
| Contextual Retrieval | Medium (LLM per chunk, ~$1/M tokens cached) | Same as hybrid | Same as hybrid | ~49% fewer failed retrievals; ~67% w/ rerank | Cache invalidation; stale context on doc edits |
| Query transforms | None | +1 LLM round-trip each | +LLM call(s) | High on hard/ambiguous queries | Latency; a bad HyDE drags retrieval |
| GraphRAG | High (5–20× ingest) | Medium–high (map-reduce) | High | Decisive on global/multi-hop | Cost; graph staleness; overkill on local Qs |
| Agentic RAG | Low | High, variable | High, variable | Decisive on open-ended | Unbounded loops; runaway cost/latency |
The prose that matters around the table:
What shifts at scale. Contextual Retrieval and GraphRAG move cost to ingestion (paid once, amortized over every query) and keep query-time cheap — ideal for high-QPS, stable corpora. Agentic RAG and query transforms move cost to query time — fine for low-QPS, high-value queries; lethal at scale. Pick where you want the cost to live.
Failure modes you must name. (1) Cache invalidation — Contextual Retrieval's cheap-ness assumes the document prefix is byte-stable; document edits force regeneration. (2) Staleness — graphs and community summaries lag the corpus; you need an incremental update story or an SLA on rebuild cadence. (3) Reranker as bottleneck — cross-encoders are O(candidates) forward passes; at 1000 QPS reranking top-100 is your dominant cost. ColBERT-style late interaction (per-token embeddings + MaxSim, PLAID index, via ragatouille) gets near-cross-encoder quality at far lower query latency — the staff move when rerank latency caps your throughput. (4) Agentic runaway — always bound steps and tokens.
Evaluate retrieval and generation separately. Build a golden set from real queries with labeled relevant chunks. Measure recall@k / nDCG / MRR on retrieval in isolation — if recall@20 is 0.6, no prompt engineering on the generator will save you, because 40% of the time the answer isn't even in context. Then measure faithfulness and answer relevancy (RAGAS, DeepEval, TruLens — LLM-as-judge) on generation. Gate both in CI so an embedding-model swap can't silently regress recall. (Full treatment in /rag/evaluation.)
task_budget ceiling, and a graceful fallback that answers with current context and flags low confidence rather than looping forever. I'd default to a static hybrid+contextual+rerank pipeline and promote only the query classes my evals show it failing — agentic everywhere is a cost regression dressed as a capability.cache_read_input_tokens > 0 on the second chunk.Flashcard. Advanced retrieval is a menu, not a ladder. Hybrid + Contextual Retrieval + rerank is the cheap, high-impact default (Anthropic: ~49% / ~67% fewer failed retrievals); query transforms buy recall on hard questions at the cost of pre-retrieval LLM hops; GraphRAG is the only thing that answers global/multi-hop questions but costs 5–20× at ingest; agentic RAG handles open-ended queries but must be bounded. Always measure retrieval recall on a golden set before adding complexity — and remember prompt caching is what makes contextualization affordable.
1/(k+rank), k≈60 fusion that needs no score calibration. https://plg.uwaterloo.ca/~gvcormack/cormacksigir09-rrf.pdfNext: /rag/evaluation — how to build the golden set, separate retrieval from generation metrics, and regression-gate all of this in CI so your "advanced" pipeline doesn't quietly regress on the next embedding-model swap.