RAG & Retrieval
IC5IC6

Advanced Retrieval

The techniques that turn a demo RAG into a production one — contextual retrieval, query transforms, GraphRAG, and agentic retrieval — and the cost/latency/quality math that tells you which to reach for.

16 min read · 13 sections
Prerequisites: rag/foundations, hybrid-search-and-rerank, chunking-and-embeddings

1. Quick anchor

Naive RAG is chunk → embed → cosine → stuff. Advanced retrieval is everything you bolt on once you measure recall on real queries and discover the embedding model is silently dropping a third of the relevant chunks. The mental model: retrieval is the bottleneck, not generation — a frontier model can only reason over what you put in its context, so the highest-leverage engineering is upstream of the LLM. This lesson covers the four moves that consistently move the needle in 2026: Contextual Retrieval (use an LLM to situate each chunk before indexing), query transforms (rewrite/expand/decompose the question), GraphRAG (build a knowledge graph for global and multi-hop questions a flat index can't answer), and agentic RAG (let the model decide whether, what, and when to retrieve). The staff-level skill is not knowing these exist — it's knowing the cost/latency/quality tradeoff of each well enough to not deploy GraphRAG when hybrid + rerank would have done.

2. Why interviewers probe this

This is the topic that separates "I followed a LangChain tutorial" from "I have run a retrieval system that real users depend on." The interviewer wants to see whether you reach for complexity reflexively or diagnostically.

At IC4, the tells are:

  • Can you explain why dense embeddings miss exact tokens (IDs, error codes, names) and why BM25 misses paraphrase — and therefore why hybrid wins?
  • Do you know Contextual Retrieval as a named technique and can you state the prepend-context trick and the prompt-caching reason it's cheap?
  • Can you separate retrieval metrics (recall@k, nDCG, MRR) from generation metrics (faithfulness, answer relevancy) and say which one you'd debug first?

At IC5, additionally:

  • You propose a measurement before a technique. "I'd build a golden set of ~200 labeled query→chunk pairs and check recall@20 before touching the index."
  • You can sketch when GraphRAG earns its 5–20× build cost versus when it's a liability.
  • You reason about ingestion-time vs. query-time cost shifting (Contextual Retrieval and GraphRAG are expensive at build, cheap at query; agentic RAG is the reverse).

At IC6, additionally:

  • You treat retrieval as a system with regression gates in CI, access-control propagation, freshness SLAs, and a cost model per query.
  • You're opinionated about agentic RAG: where the self-correction loop pays for itself and where it's an unbounded latency/cost risk that needs a hard step budget.
  • You can articulate the composition story — RAG vs. fine-tune vs. long-context — and why they aren't mutually exclusive.

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

  • RAG — "retrieval-augmented generation": before answering, the system looks up relevant text and pastes it into the prompt so the model has facts to work from.
  • Chunk — one small slice of a document (a paragraph or two) that gets stored and searched as a unit.
  • Index — the searchable store of all chunks. "Indexing" means processing chunks now so you can find them fast later.
  • Embedding — a chunk turned into a list of numbers that captures its meaning, so chunks with similar meaning sit near each other and a query can find them.
  • LLM — a large language model, the AI that reads text and writes text (it can also summarize or label chunks).
  • Entity — a specific thing a text mentions: a person, company, place, or product.
  • Multi-hop question — a question you can only answer by connecting several separate facts (hop A to B to C).

Step by step.

  1. Plain RAG splits a document into chunks and indexes each one on its own.
  2. Problem: a lone chunk like "revenue grew 30%" loses who and when, so search misses it.
  3. Contextual retrieval fixes this: an LLM reads the whole document and writes one short context sentence ("This is from Acme's 2024 Q3 report...") that is prepended to each chunk before indexing.
  4. Now each chunk carries its own context, so queries match it far more reliably.
  5. GraphRAG tackles big-picture questions instead: an LLM scans the docs and extracts entities and how they relate, building a graph (nodes = things, edges = relationships).
  6. To answer a multi-hop or "summarize the whole corpus" question, the system walks that graph and gathers connected facts no single chunk holds.

Remember this: contextual retrieval makes each chunk findable by adding its missing context, while GraphRAG links facts into a graph so you can answer broad, multi-step questions.

3.1 First principles: why retrieval fails

A bi-encoder maps query and document independently into a vector space and scores by cosine/dot product. That single vector is a lossy summary. Two failure modes follow directly:

  1. Lexical blindness. "What's the fix for error ORA-01555?" — the string ORA-01555 carries almost all the information, but a 1024-dim float vector smears it into a neighborhood of "database error" chunks. Sparse retrieval (BM25) nails exact tokens, rare terms, codes, and IDs because it scores on term frequency, not semantics.
  2. Context loss at chunk boundaries. You split a 40-page contract into 800-token chunks. A chunk reads "The cap shall not exceed 18 months of fees." Whose cap? Which agreement? The embedding has no idea, because the disambiguating context lived three pages up. The chunk is individually unretrievable for "Acme MSA liability cap."

Hybrid search (covered in depth here) fixes (1). Contextual Retrieval fixes (2). They compose.

3.2 Contextual Retrieval (Anthropic, Sept 2024)

The insight is almost embarrassingly simple: before you embed or BM25-index a chunk, prepend a short, chunk-specific blurb that situates it in its parent document — generated by an LLM that has read the whole document. The chunk above becomes:

This chunk is from the Limitation of Liability section (§9.2) of the Acme–Globex Master Services Agreement dated 2024. The cap shall not exceed 18 months of fees.

Now "Acme MSA liability cap" retrieves it on both the dense and sparse paths. Anthropic's published results: contextual embeddings + contextual BM25 reduced failed retrievals (top-20) by ~49%, and ~67% when combined with reranking. Those are among the most reliable public numbers in RAG — attribute them in an interview, don't round them up.

The architecture has three independent levers, and you want all three:

  • Contextual embeddings — embed context + chunk.
  • Contextual BM25 — index context + chunk in the sparse index too.
  • Reranking — pull a wide candidate set (top-50/100), fuse the two lists with Reciprocal Rank Fusion (RRF: score = Σ 1/(k + rank), k≈60, from Cormack 2009 — robust because it never touches the raw, uncalibrated scores), then run a cross-encoder reranker (Cohere Rerank 3, bge-reranker-v2, mxbai-rerank) to cut to top-5.
Reciprocal Rank Fusion (RRF) — on real numbers

RRF combines two ranked lists (dense and sparse) without touching their raw scores — they're on different scales and can't be directly added. Instead, for each document at each position (rank 0, 1, 2, ...), you give it a score of 1 / (k + rank). The k is a constant (typically k = 60, from Cormack's research) that smooths the impact of early ranks. Then sum these contributions across both lists. For example, if a document appears at rank 2 in the dense list and rank 5 in the sparse list, with k = 60: it gets 1 / (60 + 2) ≈ 0.0161 from dense plus 1 / (60 + 5) ≈ 0.0149 from sparse, totaling 0.0310. Documents ranked 0 (first place) in both lists score 1 / 600.0167 each for 0.0334. The rank comes first (0-indexed), so rank 0 is position 1 in human terms — documents that appear high in both lists bubble to the top of the fused result.

Why it works: you never compare the raw cosine scores or BM25 scores directly (they have different ranges and meanings), only the ordinal position, so calibration doesn't matter. Both a perfect confidence score of 0.999 and a weak 0.501 vote equally on position — robustness is the goal.

Why the per-chunk LLM call is affordable: prompt caching. Naively, generating context for each of a document's N chunks means feeding the whole document to the LLM N times — quadratic in document size, and ruinous. Prompt caching collapses this. You cache the document once (paying a ~1.25× write premium for the 5-minute TTL), then each per-chunk call re-reads the cached document at ~0.1× the input price and only pays full price for the tiny chunk + instruction + ~50-token output. Anthropic reported a one-time cost of roughly $1.02 per million document tokens to generate contextual chunks this way. The mechanism matters: prompt caching is a prefix match, so the document must sit at the front of the prompt (the stable prefix) and the per-chunk instruction at the end (the volatile suffix). Put a timestamp or chunk index ahead of the document and you invalidate the cache on every call — your bill 10×'s silently and cache_read_input_tokens reads zero.

3.3 Query transforms

The query the user typed is rarely the optimal query for the index. Transforms reshape it:

  • HyDE (Hypothetical Document Embeddings, Gao 2022). Ask an LLM to hallucinate an answer to the query, then embed the hypothetical answer and search with that vector. Answers live closer to documents in embedding space than questions do. Cheap, surprisingly effective on zero-shot domains; the failure mode is that a confidently wrong hypothetical drags retrieval off-topic.
  • Query rewriting. Strip conversational cruft, resolve coreferences ("it" → "the 2024 contract"), and normalize. Essential for multi-turn chat where the literal last message ("what about the cap?") is meaningless standalone.
  • Decomposition. Break a compound question ("Compare the liability caps in the Acme and Globex MSAs") into sub-queries, retrieve for each, then synthesize. This is the bridge to agentic RAG.
  • Multi-query / RAG-Fusion. Generate several paraphrases of the query, retrieve for each, RRF-fuse the result sets. Trades latency and embedding calls for recall — worth it when a single phrasing under-recalls.
  • Step-back prompting. Ask a more general version first ("What governs liability in MSAs generally?") to pull in foundational context, then the specific question.

The honest tradeoff: every transform adds at least one LLM round-trip before retrieval. On a latency budget of 300 ms that's often unacceptable; on a research assistant tolerating 10 s it's free money. Measure recall lift per added millisecond.

3.4 GraphRAG (Microsoft, 2024)

Flat vector search answers local questions: "What is X?" It cannot answer global, sense-making questions: "What are the major themes across these 10,000 incident reports?" No single chunk contains the answer; the answer is a property of the whole corpus.

GraphRAG addresses this by, at ingestion time, using an LLM to extract entities and relationships from every chunk into a knowledge graph, running community detection (the Leiden algorithm) to cluster the graph hierarchically, and then generating community summaries at each level. A global query is answered by map-reducing over relevant community summaries rather than over raw chunks. This genuinely unlocks multi-hop ("which engineers touched services that depend on the auth module?") and thematic queries that flat RAG simply gets wrong.

The cost is real and it's where IC5/IC6 candidates earn their stripes: graph construction is an LLM call per chunk (entity/relation extraction) plus summary generation per community — easily 5–20× the ingestion cost of a vector index, and it must be re-run or incrementally patched as the corpus changes. Reach for GraphRAG only when your eval set contains genuinely global/multi-hop questions that hybrid + rerank measurably fails. For the common case — "find the passage that answers this question" — it's expensive overkill. Lighter variants exist: LightRAG (dual-level retrieval, cheaper incremental updates) and agentic graph traversal, where the agent walks the graph on demand instead of pre-summarizing every community.

3.5 Agentic RAG

The pipelines above are static: a fixed sequence runs on every query. Agentic RAG hands control to the model. Retrieval becomes a tool the model calls (tying directly to /agents and /agents/tool-use-and-mcp — your search index is often exposed as an MCP server). The model now decides:

  • Whether to retrieve at all. "What's 2+2?" needs no retrieval; a static pipeline wastefully retrieves anyway.
  • What to retrieve — it reformulates the query, picks the right source (which is routing: legal corpus vs. code corpus vs. web).
  • When to stop — it retrieves, inspects, judges whether the context is sufficient, and retrieves again with a refined query if not. This is the self-correction loop.
  • Groundedness self-check — before answering, it can verify each claim is supported by retrieved context.

This is strictly more capable and strictly more dangerous. The capability: it handles multi-hop, ambiguous, and "I don't know what I'm looking for" queries that no static pipeline can. The danger: unbounded loops. An agent that keeps deciding "not enough context, retrieve again" can burn 15 LLM calls and 30 seconds on one query. In production you always impose a hard step budget (e.g. ≤4 retrieval rounds), a token budget, and a fallback to "answer with what you have + flag low confidence." With Claude, adaptive thinking lets the model decide reasoning depth per step, and a task_budget can give the loop a self-moderating token ceiling — but the harness-level step cap is non-negotiable.

The pragmatic 2026 stance: start static (hybrid + contextual + rerank), and promote to agentic only for the query classes your evals show static retrieval failing. Agentic RAG is the right default for open-ended research assistants and the wrong default for a latency-sensitive support widget.

4. Minimal implementation

The single highest-leverage technique here is Contextual Retrieval, and the load-bearing detail is the prompt-caching layout. Here's a real, runnable implementation using claude-haiku-4-5 (cheapest model, ample for this summarization task) to generate per-chunk context, with the parent document cached.

# pip install anthropic rank_bm25 sentence-transformers
import anthropic
 
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env
 
# The document goes FIRST (stable cached prefix); the chunk instruction goes
# LAST (volatile suffix). Reversing these two breaks the prefix cache.
DOC_BLOCK = "<document>\n{doc}\n</document>"
CHUNK_INSTRUCTION = (
    "Here is a chunk from the document above:\n"
    "<chunk>\n{chunk}\n</chunk>\n\n"
    "Give a short, succinct context (1-2 sentences) situating this chunk within "
    "the overall document, to improve search retrieval of the chunk. "
    "Answer ONLY with the context, nothing else."
)
 
def situate(doc: str, chunk: str):
    """Generate a retrieval-context blurb for one chunk.
 
    The full document is cached once (cache_control on the first block); every
    subsequent chunk in the same document re-reads it at ~0.1x input price.
    """
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": DOC_BLOCK.format(doc=doc),
                    "cache_control": {"type": "ephemeral"},  # <-- cache the doc
                },
                {
                    "type": "text",
                    "text": CHUNK_INSTRUCTION.format(chunk=chunk),
                },
            ],
        }],
    )
    context = next(b.text for b in resp.content if b.type == "text")
What this code does, section by section

The situate function takes the full document and one chunk, sends them to Claude, and gets back a context blurb. Breaking it down: (1) message assembly — you build a list with two text blocks: first the document (wrapped in <document> tags and marked with cache_control: {"type": "ephemeral"}), second the chunk instruction. The document goes first so it becomes the stable cached prefix; the instruction goes second so each call varies only that tiny part. (2) API call — send to Claude Haiku (cheap, fast enough for summarization) with a small token budget (150 tokens for the context). (3) extraction — pull the text from the response (ignoring any other content types like images). (4) return — give back both the context string and the usage metrics so the caller can verify that caching worked (checking cache_read_input_tokens > 0 on the second+ call). The whole flow is wrapped in a loop in contextualize_document, which calls situate once per chunk and builds an enriched list of "context + chunk" strings ready for embedding and indexing.

return context, resp.usage

def contextualize_document(doc: str, chunks: list[str]) -> list[str]: enriched = [] for i, chunk in enumerate(chunks): context, usage = situate(doc, chunk) # Sanity-check caching on the 2nd+ call: cache_read should dominate. if i == 1: assert usage.cache_read_input_tokens > 0, "doc not cached — check prefix" enriched.append(f"{context}\n\n{chunk}") # prepend, then index THIS string return enriched

 
You then index the *enriched* strings on **both** paths — dense and sparse — and fuse at query time:
 
```python
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
 
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")  # or OpenAI / Cohere / Voyage
 
enriched = contextualize_document(doc, chunks)
dense = embedder.encode(enriched, normalize_embeddings=True)      # contextual embeddings
bm25 = BM25Okapi([c.split() for c in enriched])                  # contextual BM25
 
def rrf(rankings: list[list[int]], k: int = 60) -> list[int]:
    scores: dict[int, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)
 
def retrieve(query: str, top_k: int = 50) -> list[int]:
    q = embedder.encode([query], normalize_embeddings=True)[0]
    dense_rank = np.argsort(-(dense @ q))[:top_k].tolist()
    sparse_rank = np.argsort(-bm25.get_scores(query.split()))[:top_k].tolist()
    fused = rrf([dense_rank, sparse_rank])
    return fused[:top_k]  # -> feed top-50 to a cross-encoder reranker, cut to top-5

Honest notes on this code: (1) For real ingestion of thousands of documents, run the situate calls through the Batches API at 50% off — context generation is not latency-sensitive. (2) The assert on cache_read_input_tokens is your single best defense against a silently-broken cache; keep something like it in your ingestion telemetry. (3) The reranker (e.g. FlagEmbedding's bge-reranker-v2 or cohere.rerank) is omitted for brevity but is where the gain jumps from ~49% to ~67% — don't skip it. (4) Caching has a minimum prefix (~4096 tokens on Haiku 4.5); a tiny document won't cache and the per-chunk cost stays higher — that's fine, the absolute cost is still small.

5. Production tradeoffs

Technique Build/ingest cost Query latency Query cost Quality lift Main failure mode
Naive dense Low Low Low Baseline Lexical blindness; context loss
Hybrid + rerank Low +rerank (~50–200 ms) +rerank Large on lexical/recall Reranker latency at high QPS
Contextual Retrieval Medium (LLM per chunk, ~$1/M tokens cached) Same as hybrid Same as hybrid ~49% fewer failed retrievals; ~67% w/ rerank Cache invalidation; stale context on doc edits
Query transforms None +1 LLM round-trip each +LLM call(s) High on hard/ambiguous queries Latency; a bad HyDE drags retrieval
GraphRAG High (5–20× ingest) Medium–high (map-reduce) High Decisive on global/multi-hop Cost; graph staleness; overkill on local Qs
Agentic RAG Low High, variable High, variable Decisive on open-ended Unbounded loops; runaway cost/latency

The prose that matters around the table:

What shifts at scale. Contextual Retrieval and GraphRAG move cost to ingestion (paid once, amortized over every query) and keep query-time cheap — ideal for high-QPS, stable corpora. Agentic RAG and query transforms move cost to query time — fine for low-QPS, high-value queries; lethal at scale. Pick where you want the cost to live.

Failure modes you must name. (1) Cache invalidation — Contextual Retrieval's cheap-ness assumes the document prefix is byte-stable; document edits force regeneration. (2) Staleness — graphs and community summaries lag the corpus; you need an incremental update story or an SLA on rebuild cadence. (3) Reranker as bottleneck — cross-encoders are O(candidates) forward passes; at 1000 QPS reranking top-100 is your dominant cost. ColBERT-style late interaction (per-token embeddings + MaxSim, PLAID index, via ragatouille) gets near-cross-encoder quality at far lower query latency — the staff move when rerank latency caps your throughput. (4) Agentic runaway — always bound steps and tokens.

Evaluate retrieval and generation separately. Build a golden set from real queries with labeled relevant chunks. Measure recall@k / nDCG / MRR on retrieval in isolation — if recall@20 is 0.6, no prompt engineering on the generator will save you, because 40% of the time the answer isn't even in context. Then measure faithfulness and answer relevancy (RAGAS, DeepEval, TruLens — LLM-as-judge) on generation. Gate both in CI so an embedding-model swap can't silently regress recall. (Full treatment in /rag/evaluation.)

6. How it's asked

[IC4] "Walk me through Contextual Retrieval and why it doesn't cost a fortune." You prepend an LLM-generated, document-aware blurb to each chunk before embedding and BM25-indexing it, so chunks that were individually ambiguous become retrievable. It's affordable because of prompt caching: you cache the parent document once and each per-chunk call re-reads it at ~0.1× input price instead of re-sending the whole document — Anthropic reports ~$1.02 per million document tokens. Anthropic measured ~49% fewer failed retrievals with contextual embeddings + contextual BM25, ~67% with reranking added.
[IC4] "Dense retrieval is missing exact product SKUs in my e-commerce search. Fix it without retraining." Add a BM25 sparse path and fuse with RRF — SKUs are exact tokens that term-frequency matching catches and dense vectors smear. That alone likely recovers most of the loss; if recall is still short, contextualize the chunks so a SKU's surrounding product context is in the indexed text, and add a reranker on the fused top-50. No model retraining required, and I'd verify the lift on a labeled SKU-query set before shipping.
[IC5] "50k internal docs; users ask both 'what does this policy say' and 'what are the recurring themes in last quarter's incidents.' Architect it." Two query classes, two strategies. For the local "what does X say" questions, hybrid + Contextual Retrieval + rerank is the workhorse and I'd ship it first because it's cheap at query time. The thematic/global questions are exactly what flat vector search can't do — those justify a GraphRAG layer (entity/relation extraction → Leiden communities → community summaries). I'd route between them with a lightweight classifier or let an agent pick the source, and I'd only build the graph after confirming on a golden set that hybrid genuinely fails the global questions, because graph construction is 5–20× the ingest cost and needs an incremental-update plan.
[IC6] "When does agentic RAG earn its keep, and how do you keep it from melting your budget?" It earns its keep on open-ended, multi-hop, or ambiguous queries where a single retrieval pass structurally can't gather the right context — the agent reformulates, routes among sources, and self-corrects until the context is sufficient. The cost is variable latency and the risk of unbounded retrieve-judge-retrieve loops, so in production I impose a hard step budget (≤3–4 rounds), a token/task_budget ceiling, and a graceful fallback that answers with current context and flags low confidence rather than looping forever. I'd default to a static hybrid+contextual+rerank pipeline and promote only the query classes my evals show it failing — agentic everywhere is a cost regression dressed as a capability.
[IC6] "RAG, fine-tuning, or just a 1M-token context window?" They solve different problems and compose. RAG is for fresh, large, attributable, access-controlled knowledge — you can cite sources and enforce per-user document permissions, which a fine-tune can't. Fine-tuning (/finetuning) is for behavior, format, and domain skill, not facts. Long context (/context-engineering) is for whole-document reasoning where you genuinely need everything in view, but it pays in cost, latency, and "lost in the middle" degradation, so it's not a retrieval replacement at corpus scale. A mature system often does all three: retrieve the right slice, stuff a generous context, and run a model fine-tuned for the domain's output format.

7. Pitfalls & flashcards

  • Reaching for GraphRAG/agentic before measuring. If you haven't built a golden set and checked recall@k, you're guessing. The cheapest win is almost always hybrid + rerank + contextualization.
  • Breaking the prompt cache in Contextual Retrieval. Putting anything volatile (timestamp, chunk index, UUID) ahead of the cached document invalidates the prefix every call. Verify cache_read_input_tokens > 0 on the second chunk.
  • Forgetting to contextualize the BM25 index too. Contextual embeddings alone leave the sparse path blind to the added context; index the enriched string on both paths.
  • Generating context per chunk synchronously at query time. It's an ingestion step. Do it offline, ideally via the Batches API.
  • Letting an agentic loop run unbounded. No step budget = a single query that costs $5 and 40 seconds. Always cap rounds and tokens with a fallback.
  • Evaluating end-to-end only. A great answer can mask terrible retrieval (the model knew it anyway); a bad answer can mask great retrieval (the generator fumbled). Measure the two stages separately.
  • Stale graphs/contexts. Any LLM-derived index layer (community summaries, chunk contexts) drifts from the source on edits — own the rebuild/incremental-update SLA.

Flashcard. Advanced retrieval is a menu, not a ladder. Hybrid + Contextual Retrieval + rerank is the cheap, high-impact default (Anthropic: ~49% / ~67% fewer failed retrievals); query transforms buy recall on hard questions at the cost of pre-retrieval LLM hops; GraphRAG is the only thing that answers global/multi-hop questions but costs 5–20× at ingest; agentic RAG handles open-ended queries but must be bounded. Always measure retrieval recall on a golden set before adding complexity — and remember prompt caching is what makes contextualization affordable.

8. Further reading

Next: /rag/evaluation — how to build the golden set, separate retrieval from generation metrics, and regression-gate all of this in CI so your "advanced" pipeline doesn't quietly regress on the next embedding-model swap.

Primary sources
← More in RAG & Retrieval