IC4IC5

Chunking and Embeddings

The ingest-side decisions—how you cut documents and how you turn text into vectors—silently cap your retrieval ceiling, which is why interviewers probe them before they ask about rerankers.

16 min read · 14 sections

Prerequisites: what a bi-encoder is, cosine similarity / dot product, /rag/foundations

1. Quick anchor

Chunking and embedding are the two ingest-time decisions that set the ceiling on everything downstream. Retrieval can only ever return spans you indexed, in the granularity you indexed them, encoded by a model that may or may not have understood them. A chunk is the atomic unit a retriever can return: too big and the embedding averages away the signal (and you waste context tokens on irrelevant text); too small and the chunk loses the context that made it meaningful. An embedding is a lossy projection of text into a vector where geometric closeness is supposed to mean semantic relevance—but a bi-encoder compresses a whole passage into one vector before it has seen the query, so anything it discarded is gone. Get these two right and a reranker has good candidates to sharpen; get them wrong and no reranker, query rewrite, or 1M-token window can recover what was never retrievable. This is why retrieval, not generation, is almost always the bottleneck.

2. Why interviewers probe this

This topic separates people who have tuned a RAG system from people who have only wired one together. The "default" pipeline (RecursiveCharacterTextSplitter(chunk_size=1000) + text-embedding-3-small + cosine) works in a demo and quietly fails in production, and interviewers want to see whether you know why and where.

At IC4 the tells are:

You can name and contrast concrete chunking strategies (fixed, recursive, structure-aware, parent-child) and say when each applies—not just "I chunk the docs."
You know an embedding model is a bi-encoder producing one vector per chunk, and that cosine ranks by direction; you don't conflate it with a cross-encoder/reranker.
You measure chunking and embedding choices with retrieval metrics (recall@k, nDCG) on a golden set rather than eyeballing answers.

At IC5/IC6 the tells are:

You reason about the cost/latency/quality/memory surface explicitly: dimensions vs. recall, HNSW memory vs. IVF-PQ, int8/binary quantization, Matryoshka truncation, re-embedding migrations.
You treat the embedding model as a versioned dependency with a migration story (you cannot mix vectors from two models in one index), and you know domain/multilingual fit beats leaderboard rank.
You connect ingest decisions to system behavior: why contextual retrieval helps, why "lost in the middle" makes returning giant parents risky, when to abandon flat vector search for graph/structure-aware approaches, and how you'd A/B a new chunker without a full re-index.

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

Chunk — one atomic piece of a document that the retriever can return (typically 200–500 tokens).
Embedding — a vector (a list of numbers) that represents the meaning of text; closer vectors = more similar meaning.
Bi-encoder — a model that encodes each document chunk independently into a vector before seeing any query.
Cosine similarity — how aligned two vectors are (ranges 0–1 when normalized); the retriever ranks chunks by cosine with the query vector.
Quantization — shrinking vectors to use less memory (e.g., float32 → int8, 4× smaller) with a small quality hit.
Contextual retrieval — prepending a mini description of each chunk before embedding it, so the vector captures what document it's from and why it matters. Step by step.

Take a document and split it into small pieces (chunks) using rules or structure—e.g., by paragraph or heading.
Feed each chunk into an embedding model; it produces one vector per chunk, independent of the query.
When a query comes in, embed it the same way, producing one query vector.
Compare the query vector to every chunk vector using cosine similarity—higher score = more relevant.
Rank and return the top chunks.
The retriever can only return chunks you indexed; if you indexed huge chunks, the generator wastes context tokens on noise. Remember this: Chunking and embeddings are the two ingest decisions that set the ceiling on everything downstream—a retriever can only return what it indexed, in the granularity you chose, so get these two right and reranking can shine; get them wrong and no amount of fancy ranking later will recover what was never there.

3.1 First principles: what a chunk is, and why granularity is the whole game

A retriever returns units you chose at ingest. If your unit is a 4,000-token section, the smallest thing you can hand the generator is that whole section, even when the answer is one sentence—so you pay context tokens for noise and risk "lost in the middle," where models attend less reliably to material buried in the center of a long context. If your unit is a single sentence, you can pin the exact answer but you've stripped the surrounding context that disambiguates it ("It was discontinued in 2021"—what was?).

So chunking is a bias/variance tradeoff on information density per vector. A bi-encoder mean-pools (or CLS-pools) a span into one fixed vector; the more heterogeneous the span, the more that vector is a blurry average of several topics, and the lower its cosine similarity to any specific query. The practical sweet spot for prose is usually ~200–500 tokens with ~10–20% overlap, but the right answer depends on your query distribution: fact-lookup queries want small chunks, synthesis/"explain" queries want larger or hierarchical ones.

✎ Chunking granularity — on real numbers

Imagine you have a document discussing both "revenue recognition" and "tax accounting." You embed it as one chunk.

The embedding model compresses both topics into a single 384-d vector (one number per dimension).
A query asking "How do we recognize revenue?" becomes a separate 384-d vector.
Cosine similarity between them is lower than if you'd chunked the document into two topically separate chunks, because the chunk vector is a blurry average.
Real example: the revenue chunk alone might score cosine_sim = 0.78 vs the tax chunk at 0.15; together as one big chunk, the vector might score only 0.55, losing the signal from both.

Conversely, if you chunk by every sentence, you get precise hits but lose context: "It was discontinued in 2024" is a true sentence but meaningless without knowing what "it" is.

The tradeoff: ~300 tokens with ~20% overlap is the usual sweet spot because it keeps facts that depend on each other in the same vector, but small enough that the vector is mostly signal, not noise from unrelated text.

3.2 Chunking strategies, in increasing sophistication

Fixed-size splits every N tokens (or characters) with a fixed overlap. Simple, fast, and deterministic—but it cuts mid-sentence and mid-table, which is exactly where embeddings degrade. Use it only as a baseline.

Recursive (LangChain's RecursiveCharacterTextSplitter) tries a priority list of separators—["\n\n", "\n", ". ", " ", ""]—splitting on the coarsest boundary that keeps chunks under the size limit, then recursing. This respects paragraph/sentence structure far better than fixed-size and is the right default. Watch the unit: by default chunk_size counts characters, not tokens; for token budgeting use .from_huggingface_tokenizer(...) or a tiktoken-based length function.

Structure-aware splits on the document's own skeleton—Markdown headings, HTML tags, code AST nodes, PDF layout. A MarkdownHeaderTextSplitter keeps a section intact and attaches the heading path (# Billing > ## Refunds) as metadata. For code, splitting on function/class boundaries beats splitting on line count. This is the highest-leverage upgrade for structured corpora because it aligns chunk boundaries with semantic boundaries for free.

Semantic chunking places boundaries where the embedding distance between consecutive sentences spikes—i.e., where the topic shifts. It produces topically coherent chunks but costs an embedding pass at ingest and is sensitive to the threshold. In practice it often underperforms a good structure-aware splitter while costing more, so reach for it when documents lack usable structure (long unstructured transcripts).

Parent-child / small-to-big decouples what you embed from what you return: embed small, precise child chunks (good retrieval signal), but return their larger parent (good generation context). You index children with a pointer to the parent and fetch the parent at read time. This is one of the best quality/effort trades in production RAG because it fixes the granularity dilemma directly—but mind context cost: returning huge parents reintroduces "lost in the middle," so cap parent size.

Late chunking (Jina, 2024) inverts the order: run the whole document through a long-context embedding model first, then mean-pool token embeddings into chunk vectors after the transformer has attended across the full document. Each chunk vector is thus conditioned on global context (pronouns and references resolve), with one model pass instead of an LLM call per chunk. It needs a long-context embedding model and is most useful when chunks are heavily cross-referential.

Overlap carries a tail of one chunk into the next so a fact straddling a boundary survives in at least one chunk. It's cheap insurance against boundary loss; the cost is index bloat and near-duplicate hits (dedupe at retrieval). Metadata is non-negotiable: attach source, section path, timestamps, and access-control tags to every chunk so you can pre-filter (tenant, recency, permissions) before or during the vector search. Filtering is often what makes retrieval correct, not just relevant.

3.3 Contextual Retrieval: fixing the "chunk lost its context" problem

The deepest failure mode of independent chunks is that each is embedded with no knowledge of its document. Anthropic's Contextual Retrieval (Sept 2024) prepends a short, LLM-generated, chunk-specific blurb situating the chunk in its document ("This chunk is from the Q3 2023 ACME 10-Q, discussing revenue recognition...") before embedding and before BM25 indexing. Per their published results, contextual embeddings + contextual BM25 reduced failed retrievals by ~49%, and adding a reranker pushed it to ~67%. The per-chunk LLM cost—you call a cheap model like claude-haiku-4-5 once per chunk—is made affordable by prompt caching the full document across all of its chunks. This is a chunking-and-embedding technique (it changes what text gets vectorized), and it composes with hybrid search and rerank.

3.4 How embedding models actually work

An embedding model for retrieval is a bi-encoder: a transformer that maps a text span to a single dense vector, with query and document encoded independently. You compare with cosine similarity (or dot product—identical if vectors are L2-normalized). Independence is the source of both its strength and its weakness: independence is what lets you precompute every document vector once and run approximate nearest-neighbor search over millions of them in milliseconds; but because the encoder never sees the query and document together, it must guess at ingest what will matter, and it cannot do the token-level cross-attention that a cross-encoder reranker does. That's the whole architectural reason rerankers add 10–30% relevance on top of dense retrieval.

Models are trained with contrastive objectives (pull matched query/passage pairs together, push mismatched ones apart), often with hard negatives. Several leading open models—the E5 and BGE families—use asymmetric prefixes: you must prepend query: / passage: (E5) or a search instruction (BGE) or you silently lose quality. Forgetting the prefix is one of the most common, hardest-to-spot bugs in a homegrown pipeline.

3.5 Choosing an embedding model

Treat MTEB (the Massive Text Embedding Benchmark) as a starting filter, not gospel—leaderboard rank is dominated by general web text and is gameable; your corpus (legal, code, biomedical, multilingual) may rank models completely differently. The selection axes that matter in 2026:

Quality on your data. Build a golden set from real queries and labeled relevant chunks and measure nDCG@10 / recall@k on it. This number, not MTEB, decides.
Dimensionality. OpenAI text-embedding-3-large is 3072-d, -small is 1536-d; open models like BGE-M3 are 1024-d, E5/BGE-small are 384-d. More dimensions can mean marginally better recall but linearly more storage and RAM and slower distance math.
Matryoshka (MRL). Matryoshka Representation Learning trains a single model whose prefix of dimensions is itself a usable embedding, so you can truncate 3072→512 and renormalize with graceful quality loss—no re-embedding. text-embedding-3-* and several open models support this. It's the cleanest knob for trading recall against cost.
Multilingual / domain. For multilingual corpora, BGE-M3 is a strong open default—it produces dense, sparse (lexical), and multi-vector (ColBERT-style) representations from one model, and handles 100+ languages with long inputs. For English-heavy English-only work, the E5/BGE/GTE/Nomic families are competitive and cheap to self-host.
Open vs. API. API models (OpenAI, Cohere embed v3/v4, Voyage) mean zero ops and good quality; self-hosted open models (BGE, E5) mean no per-token cost, data residency, and no vendor lock-in on a dependency you can never silently swap.

3.6 Quantization and the index: making it scale

At small scale (≤1M chunks) you can keep float32 vectors in pgvector or LanceDB and not think hard. At 10M–1B chunks, memory dominates cost and you quantize:

int8 stores each dimension as a byte (4× smaller than fp32) with a small recall hit—usually the first thing to reach for. Binary (1 bit/dim, Hamming distance) is ~32× smaller and blazingly fast, used as a coarse first pass that you rescore with full-precision vectors on the top candidates. Cohere's embeddings ship int8/binary variants designed for exactly this.

✎ Quantization tradeoff — on real numbers

You have 100M chunks at 1024 dimensions each. Storage at float32: 100M * 1024 * 4 bytes = 409 GB. Storage at int8: 100M * 1024 * 1 byte = 102 GB (4× savings). Storage at binary (1 bit/dim): 100M * 1024 * 0.125 bytes = 12.8 GB (32× savings). What you lose: float32 is precise. int8 rounds each dimension to a byte (256 levels), so two chunks that were slightly different in float32 become identical in int8. Cosine similarity drops slightly (~1–3% recall hit typically). Binary is rougher—Hamming distance on 1-bit vectors is fast but fuzzy. The technique is rescoring: use the small fast binary index to find the top-500 candidates, then recompute distances on the original float32 vectors to rank them precisely. You win: 32× smaller index, blazingly fast first pass. You lose: extra work on the second pass.

Index structure is the other big lever. HNSW (Hierarchical Navigable Small World graphs) gives excellent recall and low latency but holds the graph in RAM—memory-hungry and slow to build/update. IVF-PQ (inverted file + product quantization) compresses vectors aggressively for huge corpora with much lower memory, at some recall cost and tuning effort. The universal tradeoff: recall ↔ latency ↔ memory—pick two. Tune HNSW's ef_search (and M) to move along the recall/latency curve. Vector DBs in play: pgvector, Qdrant, Weaviate, Milvus, LanceDB, Pinecone, Turbopuffer.

The staff-level point: the embedding model is a versioned dependency. Vectors from model A and model B are not comparable, so adopting a new model means re-embedding the entire corpus and rebuilding the index—a migration you must design for (shadow index, dual-write, backfill, cutover) before you ship v1.

4. Minimal implementation

A real, runnable indexing slice: recursive chunking → bi-encoder embedding (open BGE) → cosine top-k. Honest about the prefix convention and normalization.

# pip install sentence-transformers langchain-text-splitters numpy
import numpy as np
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
 
DOC = """Contextual Retrieval prepends chunk-specific context before embedding.
Anthropic reported contextual embeddings plus contextual BM25 cut failed
retrievals by about 49 percent, and adding reranking reached about 67 percent.
 
A bi-encoder maps a span to one vector with the query and document encoded
independently. That independence is why you can precompute document vectors and
run approximate nearest-neighbor search at scale, and also why a cross-encoder
reranker, which sees query and document jointly, adds relevance on top."""
 
# Recursive splitter respects paragraph/sentence boundaries.
# NOTE: chunk_size counts CHARACTERS by default; use a tokenizer length_function
# for true token budgeting (e.g. .from_huggingface_tokenizer(...)).
splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=40,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(DOC)
 
# 384-dim open bi-encoder. bge-*-en-v1.5 wants a query INSTRUCTION for retrieval;
# passages are embedded raw. Omitting the instruction silently hurts recall.
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
doc_emb = model.encode(chunks, normalize_embeddings=True)  # (N, 384), unit norm
 
query = "How much did contextual retrieval reduce failed retrievals?"
q_instr = "Represent this sentence for searching relevant passages: "
q_emb = model.encode([q_instr + query], normalize_embeddings=True)[0]
 
# Vectors are unit-normalized, so dot product == cosine similarity.

✎ Cosine similarity on real numbers

The code normalizes vectors (unit norm), so doc_emb @ q_emb (dot product) equals cosine. Let's work it out:

Query vector (normalized): [0.8, 0.6] (imagine just two dimensions).
Chunk vector A: [0.9, 0.4] (normalized).
Chunk vector B: [0.2, 0.98] (normalized). Dot product for A: 0.8 * 0.9 + 0.6 * 0.4 = 0.72 + 0.24 = 0.96. Dot product for B: 0.8 * 0.2 + 0.6 * 0.98 = 0.16 + 0.588 = 0.748. Chunk A scores 0.96, chunk B scores 0.748—A is ranked first because its direction is more aligned with the query. This is the entire ranking mechanism: direction in vector space = relevance. The code does this as one matrix multiply (doc_emb @ q_emb), and a vector database at scale does the same with approximate nearest-neighbor tricks (like HNSW graphs) to avoid computing every distance.

scores = doc_emb @ q_emb # (N,) topk = np.argsort(-scores)[:3] for rank, i in enumerate(topk, start=1): print(f"{rank}. {scores[i]:.3f} {chunks[i][:80]!r}")

 
What to notice. `normalize_embeddings=True` makes the dot product equal cosine, so the final scoring is one matrix multiply—the exact operation a vector DB's HNSW index approximates at scale. The BGE *query instruction* is load-bearing; the asymmetry between how you encode queries vs. passages is a real source of silent quality loss. This is the naive path on purpose: swap the in-memory `@` for a vector DB at scale, add contextual prefixes before `encode`, and add BM25 + a `bge-reranker`/`cohere` rerank pass on the top-50 to turn this into the advanced pipeline in [hybrid search and rerank](/rag/hybrid-search-and-rerank). Evaluate every change against a golden set with [RAGAS / retrieval metrics](/rag/evaluation)—do not eyeball it.

◇ Live illustrationThe embedding space

Embeddings place meaning in geometry: similar things sit close together. A query lands somewhere, and 'retrieval' is just finding its nearest neighbours — the engine under RAG and semantic search.

5. Production tradeoffs

Decision	Cheaper / faster	Higher quality	Failure mode if wrong
Chunk size	Larger chunks (fewer vectors, cheaper index)	Small + parent-child	Too large → averaged-out embeddings, wasted context, lost-in-the-middle; too small → context-starved chunks
Chunking strategy	Fixed-size	Structure-aware → contextual	Mid-sentence/mid-table cuts; orphaned facts
Embedding dims	Truncate via Matryoshka (e.g. 512)	Full dims (1024–3072)	Over-truncation drops recall; over-provisioning burns RAM and latency for nothing
Quantization	binary → int8	float32 + rescoring	Recall cliff with no rescoring step
Index	IVF-PQ (low RAM)	HNSW (high recall/low latency)	OOM at scale, or slow build/updates; untuned `ef_search` tanks recall or latency
Model hosting	API (no ops)	Self-host open (no per-token cost, residency)	Vendor lock-in on a dependency you can't swap without a full re-index

In prose: at small scale, almost nothing here matters—use recursive chunking, a solid API or open model at full dims, and pgvector; spend your effort on a golden set and a reranker. What changes at scale is that storage/RAM cost becomes the dominant term, so quantization and dimensionality (Matryoshka) move from "nice" to "required," and HNSW-vs-IVF-PQ becomes a real budget decision. The most expensive mistake is shipping without a re-embedding migration plan: the day a meaningfully better model lands (and one always does), you need to re-encode and rebuild the entire index without downtime—dual-write to a shadow index, backfill, validate recall parity on the golden set, then cut over. The second most expensive mistake is optimizing chunking by intuition instead of measuring recall@k and nDCG on real queries—chunking quality is empirical, and the optimum is corpus- and query-distribution-specific.

6. How it's asked

[IC3] Why chunk at all—why not embed whole documents? Because the retriever can only return the unit you indexed, and a whole-document vector is a blurry average of every topic in it: low similarity to any specific query, and if returned it dumps thousands of irrelevant tokens into the context (cost + lost-in-the-middle). Chunking lets you return precise, topically coherent spans. The cost is that a chunk can lose the context that disambiguated it, which is why overlap, structure-aware splitting, parent-child, and contextual retrieval exist.

[IC4] Walk me through the chunking strategies you'd consider and how you'd pick one. Start with recursive splitting (respects paragraph/sentence boundaries) as the default. If the corpus has structure—Markdown, HTML, code—switch to structure-aware splitting and carry the heading path as metadata; that's the biggest free win. If retrieval is precise but generations lack context, go parent-child: embed small children, return capped parents. For unstructured transcripts, consider semantic chunking. I'd validate each on a golden set with recall@k and nDCG, because the right size depends on the query distribution—fact lookups want small chunks, synthesis wants larger or hierarchical ones.

[IC5] Pick an embedding model and dimensionality for a 100M-chunk multilingual corpus, and explain the index. I'd shortlist with MTEB but decide on nDCG@10 over a golden set of real queries, because leaderboard rank doesn't capture domain or language fit. For multilingual I'd default to BGE-M3 (dense + sparse + multi-vector, 100+ languages, long inputs) or a strong API model like Cohere embed. At 100M chunks, RAM dominates: I'd use a Matryoshka-capable model and truncate to ~512–1024 dims, store int8 (4× smaller) with a float rescoring pass on the top candidates, and choose the index by budget—HNSW if I can afford the RAM for low-latency high recall, IVF-PQ if I can't. And I'd treat the model as a versioned dependency with a shadow-index re-embedding plan from day one.

[IC6] Design the indexing half of a retrieval system and how it evolves. Pipeline: structure-aware chunking with metadata (source, section path, timestamps, ACL tags) → contextual-retrieval prefixes generated by a cheap model (claude-haiku-4-5) with the full doc prompt-cached → embed with a Matryoshka multilingual bi-encoder → index dense (HNSW/IVF-PQ by scale) alongside contextual BM25 for hybrid. I separate retrieval eval from generation eval, gate changes in CI against a golden set (recall@k, nDCG, hit-rate), and never mix vectors across model versions—new models ship via dual-write to a shadow index, backfill, parity check, cutover. As the corpus grows into multi-hop "sense-making" queries that flat vector search can't answer, I'd layer in GraphRAG or agentic retrieval rather than keep over-tuning chunk size.

[IC5] Your dense retrieval misses exact product codes and rare identifiers. What's happening and what do you do? Dense bi-encoders embed meaning, so rare exact tokens (SKUs, error codes, function names) get washed out—the vector has no special mass on a string it treated as near-OOV. The fix isn't a bigger embedding model; it's hybrid search: add BM25/sparse retrieval, which excels at exact and rare-term matches, and fuse with Reciprocal Rank Fusion. Contextual retrieval and a reranker further help, but the root cause is asking one bi-encoder to do a job sparse retrieval does for free.

7. Pitfalls & flashcards

Chunking by intuition. "1000 chars felt fine" is not a decision. Measure recall@k/nDCG on a golden set; the optimum is corpus- and query-specific.
Forgetting the query/passage prefix on E5/BGE models—silent, large quality loss with no error.
Counting characters as tokens. RecursiveCharacterTextSplitter's chunk_size is characters by default; budget in tokens with a tokenizer length function.
Returning giant parents. Parent-child fixes granularity but uncapped parents reintroduce lost-in-the-middle and waste context tokens.
Embedding chunks with no document context. Independent chunks lose references; contextual retrieval (or late chunking) is the fix.
No migration plan. Vectors from two models aren't comparable; swapping models means a full re-embed + index rebuild. Design for it before v1.
Skipping metadata / ACL filters. Relevance without permission/recency filtering returns wrong answers, confidently.
Treating dense as a superset of sparse. Dense misses exact/rare tokens; sparse misses paraphrase. You need both.

Flashcard. Chunking and embeddings set the retrieval ceiling: chunk on structure (then add contextual prefixes), embed with a bi-encoder chosen by nDCG on your own golden set (not MTEB rank), and tune dims/quantization/index along the recall↔latency↔memory frontier—because a bi-encoder commits to one lossy vector before it sees the query, and only good chunks plus hybrid + rerank can recover what it left out.

8. Further reading

Anthropic, Introducing Contextual Retrieval — the ~49% / ~67% failed-retrieval reductions and the prompt-caching cost trick. https://www.anthropic.com/news/contextual-retrieval
MTEB: Massive Text Embedding Benchmark (leaderboard) — selection starting point, not the final word. https://huggingface.co/spaces/mteb/leaderboard
Kusupati et al., Matryoshka Representation Learning — truncatable embeddings for cost/recall trades. https://arxiv.org/abs/2205.13147
Chen et al., BGE-M3 — multilingual model producing dense + sparse + multi-vector outputs. https://arxiv.org/abs/2402.03216
Günther et al., Late Chunking — pool token embeddings after full-document attention. https://arxiv.org/abs/2409.04701
Malkov & Yashunin, HNSW — the graph index behind most low-latency vector search. https://arxiv.org/abs/1603.09320

Next: Hybrid Search and Rerank → — once your chunks and vectors are solid, fuse dense with BM25 and sharpen with a cross-encoder to claw back what the bi-encoder dropped.

Primary sources

← More in RAG & Retrieval