The ingest-side decisions—how you cut documents and how you turn text into vectors—silently cap your retrieval ceiling, which is why interviewers probe them before they ask about rerankers.
Chunking and embedding are the two ingest-time decisions that set the ceiling on everything downstream. Retrieval can only ever return spans you indexed, in the granularity you indexed them, encoded by a model that may or may not have understood them. A chunk is the atomic unit a retriever can return: too big and the embedding averages away the signal (and you waste context tokens on irrelevant text); too small and the chunk loses the context that made it meaningful. An embedding is a lossy projection of text into a vector where geometric closeness is supposed to mean semantic relevance—but a bi-encoder compresses a whole passage into one vector before it has seen the query, so anything it discarded is gone. Get these two right and a reranker has good candidates to sharpen; get them wrong and no reranker, query rewrite, or 1M-token window can recover what was never retrievable. This is why retrieval, not generation, is almost always the bottleneck.
This topic separates people who have tuned a RAG system from people who have only wired one together. The "default" pipeline (RecursiveCharacterTextSplitter(chunk_size=1000) + text-embedding-3-small + cosine) works in a demo and quietly fails in production, and interviewers want to see whether you know why and where.
At IC4 the tells are:
At IC5/IC6 the tells are:
The words first.
A retriever returns units you chose at ingest. If your unit is a 4,000-token section, the smallest thing you can hand the generator is that whole section, even when the answer is one sentence—so you pay context tokens for noise and risk "lost in the middle," where models attend less reliably to material buried in the center of a long context. If your unit is a single sentence, you can pin the exact answer but you've stripped the surrounding context that disambiguates it ("It was discontinued in 2021"—what was?).
So chunking is a bias/variance tradeoff on information density per vector. A bi-encoder mean-pools (or CLS-pools) a span into one fixed vector; the more heterogeneous the span, the more that vector is a blurry average of several topics, and the lower its cosine similarity to any specific query. The practical sweet spot for prose is usually ~200–500 tokens with ~10–20% overlap, but the right answer depends on your query distribution: fact-lookup queries want small chunks, synthesis/"explain" queries want larger or hierarchical ones.
Imagine you have a document discussing both "revenue recognition" and "tax accounting." You embed it as one chunk.
cosine_sim = 0.78 vs the tax chunk at 0.15; together as one big chunk, the vector might score only 0.55, losing the signal from both.Conversely, if you chunk by every sentence, you get precise hits but lose context: "It was discontinued in 2024" is a true sentence but meaningless without knowing what "it" is.
The tradeoff: ~300 tokens with ~20% overlap is the usual sweet spot because it keeps facts that depend on each other in the same vector, but small enough that the vector is mostly signal, not noise from unrelated text.
Fixed-size splits every N tokens (or characters) with a fixed overlap. Simple, fast, and deterministic—but it cuts mid-sentence and mid-table, which is exactly where embeddings degrade. Use it only as a baseline.
Recursive (LangChain's RecursiveCharacterTextSplitter) tries a priority list of separators—["\n\n", "\n", ". ", " ", ""]—splitting on the coarsest boundary that keeps chunks under the size limit, then recursing. This respects paragraph/sentence structure far better than fixed-size and is the right default. Watch the unit: by default chunk_size counts characters, not tokens; for token budgeting use .from_huggingface_tokenizer(...) or a tiktoken-based length function.
Structure-aware splits on the document's own skeleton—Markdown headings, HTML tags, code AST nodes, PDF layout. A MarkdownHeaderTextSplitter keeps a section intact and attaches the heading path (# Billing > ## Refunds) as metadata. For code, splitting on function/class boundaries beats splitting on line count. This is the highest-leverage upgrade for structured corpora because it aligns chunk boundaries with semantic boundaries for free.
Semantic chunking places boundaries where the embedding distance between consecutive sentences spikes—i.e., where the topic shifts. It produces topically coherent chunks but costs an embedding pass at ingest and is sensitive to the threshold. In practice it often underperforms a good structure-aware splitter while costing more, so reach for it when documents lack usable structure (long unstructured transcripts).
Parent-child / small-to-big decouples what you embed from what you return: embed small, precise child chunks (good retrieval signal), but return their larger parent (good generation context). You index children with a pointer to the parent and fetch the parent at read time. This is one of the best quality/effort trades in production RAG because it fixes the granularity dilemma directly—but mind context cost: returning huge parents reintroduces "lost in the middle," so cap parent size.
Late chunking (Jina, 2024) inverts the order: run the whole document through a long-context embedding model first, then mean-pool token embeddings into chunk vectors after the transformer has attended across the full document. Each chunk vector is thus conditioned on global context (pronouns and references resolve), with one model pass instead of an LLM call per chunk. It needs a long-context embedding model and is most useful when chunks are heavily cross-referential.
Overlap carries a tail of one chunk into the next so a fact straddling a boundary survives in at least one chunk. It's cheap insurance against boundary loss; the cost is index bloat and near-duplicate hits (dedupe at retrieval). Metadata is non-negotiable: attach source, section path, timestamps, and access-control tags to every chunk so you can pre-filter (tenant, recency, permissions) before or during the vector search. Filtering is often what makes retrieval correct, not just relevant.
The deepest failure mode of independent chunks is that each is embedded with no knowledge of its document. Anthropic's Contextual Retrieval (Sept 2024) prepends a short, LLM-generated, chunk-specific blurb situating the chunk in its document ("This chunk is from the Q3 2023 ACME 10-Q, discussing revenue recognition...") before embedding and before BM25 indexing. Per their published results, contextual embeddings + contextual BM25 reduced failed retrievals by ~49%, and adding a reranker pushed it to ~67%. The per-chunk LLM cost—you call a cheap model like claude-haiku-4-5 once per chunk—is made affordable by prompt caching the full document across all of its chunks. This is a chunking-and-embedding technique (it changes what text gets vectorized), and it composes with hybrid search and rerank.
An embedding model for retrieval is a bi-encoder: a transformer that maps a text span to a single dense vector, with query and document encoded independently. You compare with cosine similarity (or dot product—identical if vectors are L2-normalized). Independence is the source of both its strength and its weakness: independence is what lets you precompute every document vector once and run approximate nearest-neighbor search over millions of them in milliseconds; but because the encoder never sees the query and document together, it must guess at ingest what will matter, and it cannot do the token-level cross-attention that a cross-encoder reranker does. That's the whole architectural reason rerankers add 10–30% relevance on top of dense retrieval.
Models are trained with contrastive objectives (pull matched query/passage pairs together, push mismatched ones apart), often with hard negatives. Several leading open models—the E5 and BGE families—use asymmetric prefixes: you must prepend query: / passage: (E5) or a search instruction (BGE) or you silently lose quality. Forgetting the prefix is one of the most common, hardest-to-spot bugs in a homegrown pipeline.
Treat MTEB (the Massive Text Embedding Benchmark) as a starting filter, not gospel—leaderboard rank is dominated by general web text and is gameable; your corpus (legal, code, biomedical, multilingual) may rank models completely differently. The selection axes that matter in 2026:
text-embedding-3-large is 3072-d, -small is 1536-d; open models like BGE-M3 are 1024-d, E5/BGE-small are 384-d. More dimensions can mean marginally better recall but linearly more storage and RAM and slower distance math.text-embedding-3-* and several open models support this. It's the cleanest knob for trading recall against cost.At small scale (≤1M chunks) you can keep float32 vectors in pgvector or LanceDB and not think hard. At 10M–1B chunks, memory dominates cost and you quantize:
You have 100M chunks at 1024 dimensions each. Storage at float32:
100M * 1024 * 4 bytes = 409 GB.
Storage at int8:
100M * 1024 * 1 byte = 102 GB (4× savings).
Storage at binary (1 bit/dim):
100M * 1024 * 0.125 bytes = 12.8 GB (32× savings).
What you lose: float32 is precise. int8 rounds each dimension to a byte (256 levels), so two chunks that were slightly different in float32 become identical in int8. Cosine similarity drops slightly (~1–3% recall hit typically). Binary is rougher—Hamming distance on 1-bit vectors is fast but fuzzy. The technique is rescoring: use the small fast binary index to find the top-500 candidates, then recompute distances on the original float32 vectors to rank them precisely. You win: 32× smaller index, blazingly fast first pass. You lose: extra work on the second pass.
ef_search (and M) to move along the recall/latency curve. Vector DBs in play: pgvector, Qdrant, Weaviate, Milvus, LanceDB, Pinecone, Turbopuffer.The staff-level point: the embedding model is a versioned dependency. Vectors from model A and model B are not comparable, so adopting a new model means re-embedding the entire corpus and rebuilding the index—a migration you must design for (shadow index, dual-write, backfill, cutover) before you ship v1.
A real, runnable indexing slice: recursive chunking → bi-encoder embedding (open BGE) → cosine top-k. Honest about the prefix convention and normalization.
# pip install sentence-transformers langchain-text-splitters numpy
import numpy as np
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
DOC = """Contextual Retrieval prepends chunk-specific context before embedding.
Anthropic reported contextual embeddings plus contextual BM25 cut failed
retrievals by about 49 percent, and adding reranking reached about 67 percent.
A bi-encoder maps a span to one vector with the query and document encoded
independently. That independence is why you can precompute document vectors and
run approximate nearest-neighbor search at scale, and also why a cross-encoder
reranker, which sees query and document jointly, adds relevance on top."""
# Recursive splitter respects paragraph/sentence boundaries.
# NOTE: chunk_size counts CHARACTERS by default; use a tokenizer length_function
# for true token budgeting (e.g. .from_huggingface_tokenizer(...)).
splitter = RecursiveCharacterTextSplitter(
chunk_size=250,
chunk_overlap=40,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(DOC)
# 384-dim open bi-encoder. bge-*-en-v1.5 wants a query INSTRUCTION for retrieval;
# passages are embedded raw. Omitting the instruction silently hurts recall.
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
doc_emb = model.encode(chunks, normalize_embeddings=True) # (N, 384), unit norm
query = "How much did contextual retrieval reduce failed retrievals?"
q_instr = "Represent this sentence for searching relevant passages: "
q_emb = model.encode([q_instr + query], normalize_embeddings=True)[0]
# Vectors are unit-normalized, so dot product == cosine similarity.The code normalizes vectors (unit norm), so doc_emb @ q_emb (dot product) equals cosine. Let's work it out:
[0.8, 0.6] (imagine just two dimensions).[0.9, 0.4] (normalized).[0.2, 0.98] (normalized).
Dot product for A: 0.8 * 0.9 + 0.6 * 0.4 = 0.72 + 0.24 = 0.96.
Dot product for B: 0.8 * 0.2 + 0.6 * 0.98 = 0.16 + 0.588 = 0.748.
Chunk A scores 0.96, chunk B scores 0.748—A is ranked first because its direction is more aligned with the query. This is the entire ranking mechanism: direction in vector space = relevance. The code does this as one matrix multiply (doc_emb @ q_emb), and a vector database at scale does the same with approximate nearest-neighbor tricks (like HNSW graphs) to avoid computing every distance.scores = doc_emb @ q_emb # (N,) topk = np.argsort(-scores)[:3] for rank, i in enumerate(topk, start=1): print(f"{rank}. {scores[i]:.3f} {chunks[i][:80]!r}")
What to notice. `normalize_embeddings=True` makes the dot product equal cosine, so the final scoring is one matrix multiply—the exact operation a vector DB's HNSW index approximates at scale. The BGE *query instruction* is load-bearing; the asymmetry between how you encode queries vs. passages is a real source of silent quality loss. This is the naive path on purpose: swap the in-memory `@` for a vector DB at scale, add contextual prefixes before `encode`, and add BM25 + a `bge-reranker`/`cohere` rerank pass on the top-50 to turn this into the advanced pipeline in [hybrid search and rerank](/rag/hybrid-search-and-rerank). Evaluate every change against a golden set with [RAGAS / retrieval metrics](/rag/evaluation)—do not eyeball it.
Embeddings place meaning in geometry: similar things sit close together. A query lands somewhere, and 'retrieval' is just finding its nearest neighbours — the engine under RAG and semantic search.
| Decision | Cheaper / faster | Higher quality | Failure mode if wrong |
|---|---|---|---|
| Chunk size | Larger chunks (fewer vectors, cheaper index) | Small + parent-child | Too large → averaged-out embeddings, wasted context, lost-in-the-middle; too small → context-starved chunks |
| Chunking strategy | Fixed-size | Structure-aware → contextual | Mid-sentence/mid-table cuts; orphaned facts |
| Embedding dims | Truncate via Matryoshka (e.g. 512) | Full dims (1024–3072) | Over-truncation drops recall; over-provisioning burns RAM and latency for nothing |
| Quantization | binary → int8 | float32 + rescoring | Recall cliff with no rescoring step |
| Index | IVF-PQ (low RAM) | HNSW (high recall/low latency) | OOM at scale, or slow build/updates; untuned ef_search tanks recall or latency |
| Model hosting | API (no ops) | Self-host open (no per-token cost, residency) | Vendor lock-in on a dependency you can't swap without a full re-index |
In prose: at small scale, almost nothing here matters—use recursive chunking, a solid API or open model at full dims, and pgvector; spend your effort on a golden set and a reranker. What changes at scale is that storage/RAM cost becomes the dominant term, so quantization and dimensionality (Matryoshka) move from "nice" to "required," and HNSW-vs-IVF-PQ becomes a real budget decision. The most expensive mistake is shipping without a re-embedding migration plan: the day a meaningfully better model lands (and one always does), you need to re-encode and rebuild the entire index without downtime—dual-write to a shadow index, backfill, validate recall parity on the golden set, then cut over. The second most expensive mistake is optimizing chunking by intuition instead of measuring recall@k and nDCG on real queries—chunking quality is empirical, and the optimum is corpus- and query-distribution-specific.
claude-haiku-4-5) with the full doc prompt-cached → embed with a Matryoshka multilingual bi-encoder → index dense (HNSW/IVF-PQ by scale) alongside contextual BM25 for hybrid. I separate retrieval eval from generation eval, gate changes in CI against a golden set (recall@k, nDCG, hit-rate), and never mix vectors across model versions—new models ship via dual-write to a shadow index, backfill, parity check, cutover. As the corpus grows into multi-hop "sense-making" queries that flat vector search can't answer, I'd layer in GraphRAG or agentic retrieval rather than keep over-tuning chunk size.RecursiveCharacterTextSplitter's chunk_size is characters by default; budget in tokens with a tokenizer length function.Flashcard. Chunking and embeddings set the retrieval ceiling: chunk on structure (then add contextual prefixes), embed with a bi-encoder chosen by nDCG on your own golden set (not MTEB rank), and tune dims/quantization/index along the recall↔latency↔memory frontier—because a bi-encoder commits to one lossy vector before it sees the query, and only good chunks plus hybrid + rerank can recover what it left out.
Next: Hybrid Search and Rerank → — once your chunks and vectors are solid, fuse dense with BM25 and sharpen with a cross-encoder to claw back what the bi-encoder dropped.