RAG & Retrieval
IC4IC5IC6

Hybrid Search and Reranking

Dense embeddings paraphrase well but fumble exact tokens; BM25 nails exact tokens but is blind to paraphrase — fuse them with RRF, then let a cross-encoder rerank the survivors. It's the single highest-ROI retrieval upgrade, and interviewers use it to see if you reason about recall vs. precision as separate problems.

16 min read · 13 sections
Prerequisites: /rag/foundations, /rag/chunking-and-embeddings, cosine similarity / bi-encoders, inverted indexes

1. Quick anchor

A dense bi-encoder crushes paraphrase ("how do I cancel my plan" → "subscription termination") but quietly fails on the things that don't survive being squashed into a 1,024-dim vector: exact identifiers, SKUs, error codes, rare proper nouns, function names. Sparse retrieval (BM25) is the mirror image — it matches E-4012 and pd.merge exactly but is blind to "the thing where rows get combined." The staff-level move is to stop treating retrieval as one ranking and treat it as two stages with different jobs: a high-recall candidate stage that runs dense and sparse in parallel and fuses them, then a high-precision reranking stage where a cross-encoder reads each query–document pair jointly and reorders the top ~50–100 down to the ~5 you actually stuff into context. Recall is the cheap-to-fix bottleneck; precision is what the generator sees. This lesson is about building that two-stage pipeline and defending every knob in it. See /rag/foundations for where this sits in the overall pipeline.

2. Why interviewers probe this

Hybrid + rerank is the most reliable retrieval-quality upgrade in production RAG, and it's also where weak candidates reveal they've only ever called .similarity_search(). The question separates people who think "RAG = cosine over a vector DB" from people who think in terms of recall/precision tradeoffs, score calibration, and latency budgets.

Tells at IC4 (can build it):

  • Knows why dense misses exact tokens and can name the failure (it's lossy compression, not a bug).
  • Can wire up BM25 + dense and combine the results without hand-waving.
  • Understands "retrieve many, rerank few" as a pattern.

Tells at IC5 (can reason about it):

  • Reaches for RRF and can write the formula; explains why rank-fusion beats score-normalization.
  • Distinguishes bi-encoder (precompute, ANN-searchable) from cross-encoder (joint, not indexable) and knows why you can't just rerank everything.
  • Picks reranker depth (top-50 vs top-100) and top-k as deliberate quality/latency knobs, and evaluates retrieval separately from generation (/rag/evaluation).

Tells at IC6 (can run it at scale):

  • Treats the reranker as a capacity-planning problem: batching, GPU vs API, caching, when to skip it.
  • Knows ColBERT/late interaction as the middle path and can articulate the quality/latency/storage tradeoff vs. a cross-encoder.
  • Talks about failure modes — over-trusting reranker scores, domain mismatch, multilingual gaps — and ties retrieval choices to product SLAs and cost-per-query.

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

  • Passage — a small chunk of text (a paragraph or two) that search returns as a candidate answer.
  • BM25 (sparse keyword search) — ranks passages by how many of your query's exact words they contain, weighting rare words more. "Sparse" because each passage only touches a few of all possible words.
  • Embedding — a list of numbers that captures a text's meaning; texts that mean similar things get numerically nearby lists, even with zero shared words.
  • Dense semantic search — finds passages whose embedding is closest to the query's embedding, so it matches meaning, not spelling.
  • Bi-encoder — the model that turns query and passage into embeddings separately, so passage vectors are precomputed. Fast, but it never sees the two together.
  • Reciprocal Rank Fusion (RRF) — a recipe for merging two ranked lists using each item's position, not its raw score.
  • Cross-encoder (reranker) — a model that reads the query and one passage together and outputs one relevance score. Accurate but slow, so it only scores a shortlist.

Step by step.

  1. Index every passage two ways: a keyword index for BM25 and a vector index of embeddings.
  2. Run both on the query; each returns its own ranked top-k list.
  3. See the gaps: BM25 misses paraphrases (car vs automobile); dense misses rare exact tokens (a product code, an odd name).
  4. Merge with RRF: score each passage 1/(60 + rank) in each list, add across both, re-sort. Anything either method liked rises.
  5. Keep the merged top ~50 as a shortlist.
  6. Rerank: run the cross-encoder on each (query, passage) pair for a precise score, then sort by it.
  7. Hand the top few to the LLM.

Remember this: two cheap, complementary nets (keywords + meaning) get blended by RRF, then a slow, precise cross-encoder polishes the final order.

3.1 Why dense alone fails — and why sparse alone fails

A dense bi-encoder maps query and document independently into the same vector space; you precompute document vectors once, index them in HNSW/IVF, and at query time do an approximate-nearest-neighbor (ANN) lookup. That independence is what makes it fast and scalable — and also what makes it lossy. A 1,024-dimensional float vector is a summary. It captures topical and semantic similarity beautifully, which is exactly why it generalizes across paraphrase. But low-frequency, high-information tokens — E-4012, CVE-2026-1337, quux_handler, a part number, a rare surname — get averaged into the gestalt and lose their discriminative power. The model has often never seen the token, so it lands somewhere generic. Result: the user types the one string that uniquely identifies the answer, and dense retrieval hands back plausible-but-wrong neighbors.

Sparse retrieval is the opposite. BM25 scores a document by how many query terms it contains, weighted by how rare each term is and dampened by document length. It operates on an inverted index — token → list of docs containing it — so an exact match on a rare token is a direct hit. BM25 is exact-match semantics. Its weakness is equally structural: it has no notion that "terminate subscription" and "cancel my plan" mean the same thing. Zero lexical overlap → zero score. Stemming and synonym lists patch a little of this, brittly.

So the two methods fail on disjoint inputs. That's the whole argument for hybrid: their errors are uncorrelated, so fusing them recovers far more than either alone. This isn't folklore — Anthropic's Contextual Retrieval post reports that combining contextual embeddings with contextual BM25 cut the failed-retrieval rate by ~49% versus embeddings alone, and ~67% once reranking was added on top.

3.2 BM25 mechanics (know this cold)

BM25 scores a query $Q$ against a document $D$ as a sum over query terms:

$$ \text{score}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D),(k_1 + 1)}{f(t, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} $$

The pieces, and why each exists:

  • $f(t, D)$ — term frequency. More occurrences → higher score, but with saturation: the $\frac{f(k_1+1)}{f + k_1(\dots)}$ shape means the 10th occurrence of a word adds far less than the 2nd. $k_1$ (typically 1.2–2.0) controls how fast it saturates. This is BM25's key advantage over raw TF-IDF.
BM25 saturation on real numbers

Here's the key piece of BM25 — the term-frequency saturation. The formula is $\frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot (\dots)}$ where $f(t,D)$ is how many times term $t$ appears in document $D$, and $k_1$ (let's use 1.5) controls saturation.

Ignoring the length-norm denominator for a moment, here's what saturation looks like:

  • If a word appears 1 time: $\frac{1 \cdot 2.5}{1 + 1.5} = \frac{2.5}{2.5} = 1.0$
  • If it appears 2 times: $\frac{2 \cdot 2.5}{2 + 1.5} = \frac{5.0}{3.5} \approx 1.43$
  • If it appears 5 times: $\frac{5 \cdot 2.5}{5 + 1.5} = \frac{12.5}{6.5} \approx 1.92$
  • If it appears 10 times: $\frac{10 \cdot 2.5}{10 + 1.5} = \frac{25.0}{11.5} \approx 2.17$

Notice: jumping from 1 to 2 occurrences adds 0.43 points. Jumping from 5 to 10 adds only 0.25. That's saturation — the curve flattens, so the 10th mention of "the" barely helps (which is good). This is why BM25 beats raw TF-IDF: it refuses to let frequently-repeated words dominate.

  • $\text{IDF}(t)$ — inverse document frequency, $\approx \log\frac{N}{n_t}$. Rare terms carry more weight; "the" contributes almost nothing, E-4012 contributes a lot. This is precisely why BM25 nails identifiers.
  • $b$ (≈0.75) — length normalization strength. Without it, long documents win by sheer surface area. $\frac{|D|}{\text{avgdl}}$ normalizes by length relative to the corpus average.

You don't tune $k_1$/$b$ much in practice — defaults are fine — but you must be able to explain that BM25 is term-frequency × rarity × length-normalization, and that its exactness is the feature.

3.3 Fusing the two: Reciprocal Rank Fusion

Now you have two ranked lists. How do you merge them? The naive approach — min-max normalize each score and add — is a trap, and saying so out loud is an IC5 tell. Cosine similarities and BM25 scores live on incomparable, non-stationary scales; BM25 is unbounded and query-dependent, cosine sits in a narrow band. Any fixed normalization is fragile to outliers and shifts query to query.

Reciprocal Rank Fusion (Cormack, Clarke & Büttcher, SIGIR 2009) sidesteps the calibration problem entirely by throwing away the scores and fusing on rank:

$$

RRF on real numbers

Let's name the parts: $d$ is the document we're scoring, $r$ ranges over your two retrievers (dense and BM25), $\text{rank}_r(d)$ is where that doc ranked in retriever $r$'s list (1st, 2nd, etc., assuming 1-indexed), and $k$ is the smoothing constant (usually 60).

Walk through a concrete example. Say you have a query and two documents:

  • Document A ranks #2 in dense (nearest-neighbor hits it), #1 in BM25 (exact keyword match).
  • Document B ranks #1 in dense, #8 in BM25 (a synonym, but rare).

With $k=60$:

  • Document A scores: $1/(60+2) + 1/(60+1) = 1/62 + 1/61 \approx 0.0161 + 0.0164 = 0.0325$
  • Document B scores: $1/(60+1) + 1/(60+8) = 1/61 + 1/68 \approx 0.0164 + 0.0147 = 0.0311$

Document A wins because it ranked well in both — their errors don't overlap. If Document A only ranked #1 in dense and #50 in BM25, it would score $1/61 + 1/110 \approx 0.0255$, losing to Document A's consensus. What just happened: RRF used only rank positions (no score recalibration) and rewarded documents that satisfied both retrievers, solving the scale-incomparability problem with a rank-based formula.

\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)} $$

A document's fused score is the sum, over each retriever, of one over (a constant $k$ plus that document's rank in that retriever's list). Properties worth internalizing:

  • It only needs ordinal information, so dense and sparse become directly comparable without any normalization. This robustness is why RRF is the default fusion in Elasticsearch, OpenSearch, Weaviate, and Qdrant.
  • $k$ (the paper uses 60) is a smoothing constant. Small $k$ makes rank-1 dominate aggressively; large $k$ flattens the curve so deeper ranks still matter. With $k=60$, rank 1 scores $1/61 \approx 0.0164$ and rank 2 scores $1/62 \approx 0.0161$ — close, so a document ranked highly by both retrievers easily beats one ranked #1 by a single retriever. That consensus-rewarding behavior is the point.
  • A document appearing in both lists accumulates from both terms — agreement is rewarded structurally, not by tuning weights.

If you have a real reason to trust one retriever more (e.g., a code-search corpus where BM25 dominates), use weighted RRF: multiply each term by a per-retriever weight. But reach for that only with eval evidence — unweighted RRF is a shockingly strong baseline. RRF gets you high recall; it does not give you precision, because it never actually read the documents. That's the reranker's job.

3.4 Reranking: bi-encoders vs cross-encoders

Here is the distinction that the whole second stage hinges on.

  • A bi-encoder encodes query and document separately into vectors, then compares with a dot product. Because documents are encoded ahead of time and independently, you can index them and run ANN search over millions. Fast, scalable, lossy.
  • A cross-encoder takes the query and one document concatenated as a single input — [CLS] query [SEP] document [SEP] — and runs full self-attention across both, emitting a single relevance score. Every query token attends to every document token. This is dramatically more accurate at judging relevance because it models interaction, not just proximity of two summaries.

The catch: a cross-encoder produces no reusable document vector. You cannot precompute or index it — you must run a fresh forward pass for every (query, document) pair at query time. Scoring a million documents per query is a non-starter. Hence the canonical pattern:

Retrieve many (cheaply), rerank few (expensively). Use hybrid + RRF to fetch a high-recall candidate set of ~50–100, then run the cross-encoder over just those pairs to produce a precise final ordering, and keep the top ~3–10 for context.

Reranking typically buys a meaningful relevance lift on top of the candidate stage — Anthropic's numbers above quantify one well-known instance (49% → 67% reduction in failed retrievals when reranking is added). The named players as of 2026: Cohere Rerank 3 (managed API, strong multilingual, the easy production default), BAAI bge-reranker-v2-m3 (open weights via FlagEmbedding, self-hostable, multilingual), mxbai-rerank-v2 (Apache-2.0, fast), and Jina Reranker v2. The choice is mostly API-vs-self-host and latency/cost, not a wild quality gulf.

3.5 ColBERT and late interaction: the middle path

There's a third option that sits between bi-encoder speed and cross-encoder quality. ColBERT (Khattab & Zaharia, SIGIR 2020) keeps a per-token embedding for every token in query and document, then scores via MaxSim: for each query token, find its maximum cosine similarity against any document token, and sum those maxima:

$$ \text{score}(Q, D) = \sum_{i \in Q} \max_{j \in D} ; (q_i \cdot d_j) $$

This is "late interaction" — interaction happens at scoring time over precomputed token vectors, rather than at encoding time (cross-encoder) or being skipped entirely (bi-encoder). Because the document token vectors are precomputed and indexable, ColBERT can serve as a first-stage retriever, not just a reranker — and it recovers much of a cross-encoder's quality at a fraction of the latency. ColBERTv2 (2021) added residual compression and the PLAID engine to make the index storage tractable. The library to know is RAGatouille, which wraps ColBERT for practical RAG use.

The honest tradeoff: ColBERT's per-token vectors mean the index is much larger than a single-vector dense index (dozens of vectors per chunk instead of one) and the engineering is heavier. For most teams the pragmatic answer is still hybrid + a cross-encoder reranker; ColBERT earns its keep when reranking latency is the bottleneck and you can pay the storage, or when token-level matching genuinely helps your domain. Knowing when it's the right tool — not just that it exists — is the IC6 signal. More on advanced retrieval in /rag/advanced-retrieval.

4. Minimal implementation

A real, runnable hybrid pipeline: dense (sentence-transformers) + sparse (rank_bm25), fused with RRF, then reranked with an open-weights cross-encoder (bge-reranker-v2-m3 via FlagEmbedding). Swap the reranker for the Cohere API in two lines — shown below.

# pip install rank_bm25 sentence-transformers FlagEmbedding
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from FlagEmbedding import FlagReranker
import numpy as np
 
CORPUS = [
    "To cancel your subscription, open Billing and click End Plan.",
    "Error E-4012 means the payment processor declined the card.",
    "Subscription termination removes access at the end of the cycle.",
    "Our refund policy allows returns within 30 days of purchase.",
    "Reset E-4012 by re-authorizing the card under Payment Methods.",
]
 
# --- Index once (offline) ---
bm25 = BM25Okapi([doc.lower().split() for doc in CORPUS])         # sparse
encoder = SentenceTransformer("BAAI/bge-small-en-v1.5")           # dense bi-encoder
doc_emb = encoder.encode(CORPUS, normalize_embeddings=True)       # precomputed vectors
 
def dense_search(query, k=10):
    q = encoder.encode([query], normalize_embeddings=True)[0]
    scores = doc_emb @ q                                          # cosine (normalized dot)
    return np.argsort(-scores)[:k].tolist()
 
def sparse_search(query, k=10):
    scores = bm25.get_scores(query.lower().split())
    return np.argsort(-scores)[:k].tolist()
 
def rrf_fuse(ranked_lists, k=60):
    """Reciprocal Rank Fusion: score = sum 1/(k + rank), rank is 0-indexed here."""
    fused = {}
    for ranking in ranked_lists:
        for rank, doc_id in enumerate(ranking):
            fused[doc_id] = fused.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    return sorted(fused, key=fused.get, reverse=True)
 
# --- Query time: retrieve many, then rerank few ---
reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)  # cross-encoder
 
def hybrid_rerank(query, candidate_n=50, top_k=3):
    dense = dense_search(query, candidate_n)
    sparse = sparse_search(query, candidate_n)
    candidates = rrf_fuse([dense, sparse])[:candidate_n]          # high-recall stage
    pairs = [[query, CORPUS[i]] for i in candidates]
    scores = reranker.compute_score(pairs, normalize=True)        # joint query+doc scoring
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [(CORPUS[i], round(s, 3)) for i, s in reranked[:top_k]]  # high-precision stage
 
for hit in hybrid_rerank("how do I stop paying for my plan"):
    print(hit)          # paraphrase query → dense + reranker surface the cancel/terminate docs
print("---")
for hit in hybrid_rerank("E-4012"):
    print(hit)          # exact-token query → BM25 guarantees the E-4012 docs are candidates

What each stage is doing, and why it's honest production shape: bge-small-en-v1.5 is a real, fast embedding model (pick yours via the MTEB leaderboard — see /rag/chunking-and-embeddings); BM25Okapi is the standard rank_bm25 implementation; RRF fuses on rank so we never have to calibrate cosine against BM25 scores; and bge-reranker-v2-m3 is a true cross-encoder doing a forward pass per pair. The two queries demonstrate the disjoint-failure thesis: the paraphrase query relies on dense recall, the E-4012 query relies on sparse recall, and the same pipeline serves both.

To swap in the managed reranker, replace the FlagReranker block with the Cohere API (pip install cohere):

import cohere
co = cohere.ClientV2()  # COHERE_API_KEY in env
def rerank(query, candidates, top_k=3):
    docs = [CORPUS[i] for i in candidates]
    r = co.rerank(model="rerank-v3.5", query=query, documents=docs, top_n=top_k)
    return [(docs[res.index], round(res.relevance_score, 3)) for res in r.results]

In production you would not hand-roll BM25 and dense as separate Python lists — you'd push both into one store (pgvector + Postgres full-text/ParadeDB, or Qdrant/Weaviate/OpenSearch with native hybrid + RRF) so a single query returns the fused candidate set, and call the reranker only on those. The generator that consumes the top-k is a separate concern; if you assemble those chunks into a Claude prompt, default to claude-opus-4-8 (or claude-haiku-4-5 for cheap, latency-sensitive answer synthesis) and cache the stable instruction prefix.

Code walkthrough: the hybrid_rerank function, section by section

Let's trace through the hybrid_rerank function at lines 187–194 step by step, as if explaining it at a whiteboard.

Setup & candidate retrieval (lines 188–190): We call dense_search and sparse_search, each returning a list of top-k document indices. Then we pass both lists to rrf_fuse, which builds a dictionary where each doc_id accumulates a score from both retrievers: if doc 2 was rank 5 in dense and rank 3 in sparse, it gets 1/(60+5) + 1/(60+3). Finally we slice [:candidate_n] (default 50) to keep only the high-consensus candidates — our high-recall stage.

Pairing & cross-encoder scoring (lines 191–192): For each candidate, we build a [query, passage] pair — the exact format the cross-encoder expects. We call reranker.compute_score(pairs) to run the forward pass over all pairs jointly (the cross-encoder sees the query and each doc together, so attention can flow between them). This is the expensive part, but fast because we only scored ~50 pairs, not the whole corpus.

Reranking & return (lines 193–194): We zip the candidate indices with their new relevance scores, sort descending by score (not by the RRF rank anymore — we trust the cross-encoder now), and slice [:top_k] (default 3) to return only the final top answers. The reranker may have completely reordered the candidates; it often promotes a low-RRF doc that's precisely relevant and buries a high-RRF doc that's topically related but not quite the answer.

What the whole function does: It retrieves a high-recall candidate set by fusing two weak but complementary signals, then runs a slow-but-accurate ranker over only those winners to surface the most precise final answers. This is the core pattern.

5. Production tradeoffs

Stage Latency (typical) Cost driver Quality role Primary failure mode
Dense ANN (HNSW) 5–30ms RAM for index; embedding compute Recall on paraphrase/semantics Misses exact tokens, codes, rare terms
Sparse BM25 1–10ms Inverted index (cheap) Recall on exact/rare terms Blind to synonyms/paraphrase
RRF fusion <1ms Negligible Combines recall, no precision Bad candidate from one retriever can still surface
Cross-encoder rerank (top-50) 50–600ms GPU time or per-1k API calls Precision — the big lift Latency spike; over-trusted scores; domain/lang mismatch
ColBERT (late interaction) 10–60ms Large multi-vector index storage Near-CE quality, retriever-capable Index size; heavier ops

Cost/latency in prose. The reranker is where your budget goes. A cross-encoder over 50 candidates is 50 forward passes — batch them on a GPU and it's tens of ms; do it sequentially over an API and p95 balloons. Three levers at scale (the IC6 answer): (1) shrink the candidate set — rerank top-25 instead of top-100 if eval shows recall is already saturated there; quality is roughly flat past the point where the relevant doc is reliably in the candidate set, so deeper reranking just burns latency. (2) Cache reranker scores keyed on (query, doc-id) — head queries repeat, and reranking is deterministic. (3) Tier the rerank: route only ambiguous or high-value queries through the expensive cross-encoder; serve the long tail from hybrid + RRF alone, or use ColBERT as a cheaper "good enough" reranker. Reserve a managed API (Cohere) when you want zero ops and multilingual coverage; self-host bge/mxbai when QPS is high enough that per-call pricing dominates and a GPU amortizes.

What changes at scale. At low QPS the API reranker is the right call — no GPU to babysit. As you cross into thousands of QPS, per-call cost and network round-trips push you to a self-hosted reranker with dynamic batching (vLLM-style or a dedicated inference server), and you start caring about GPU utilization and tail latency, not just mean. The candidate store also matters: a single hybrid-capable engine (Qdrant, Weaviate, OpenSearch, Turbopuffer) beats stitching two systems, because keeping a separate BM25 index and vector index in sync — same chunk boundaries, same doc IDs, same deletes — is a real source of silent recall bugs.

Failure modes to name in an interview: (1) Over-trusting the reranker — a confident relevance score is not groundedness; a doc can rank #1 and still not contain the answer, so you still evaluate faithfulness downstream (/rag/evaluation). (2) Domain/language mismatch — an English-tuned reranker silently degrades on code, legal text, or other languages; verify on your data. (3) Candidate-set ceiling — reranking can only reorder what the first stage retrieved; if recall@50 is bad, no reranker saves you, which is exactly why you measure retrieval recall before touching the reranker. (4) Chunking interactions — tiny chunks rerank well but starve the generator of context; this is where parent-child / small-to-big retrieval and Contextual Retrieval pay off (see /rag/chunking-and-embeddings).

6. How it's asked

[IC4] "Dense-only RAG returns generic troubleshooting pages for the query E-4012 but never the doc that literally contains E-4012. Diagnose and fix." The embedding model compresses text into a single vector and rare, high-information tokens like E-4012 lose their discriminative power — the model likely never saw the token, so it lands in a generic region. The fix is to add a sparse BM25 retriever, which matches E-4012 exactly via its inverted index, and fuse the two result lists with Reciprocal Rank Fusion so exact-match queries and paraphrase queries are both served. That alone typically recovers most of these failures; adding a reranker tightens precision further.
[IC5] "Walk me through RRF. Why fuse on ranks instead of normalizing and adding scores? What is k?" RRF scores each document as the sum over retrievers of $1/(k + \text{rank})$, using only each list's ordinal position. We fuse on rank because BM25 and cosine scores live on incomparable, query-dependent, non-stationary scales — any fixed min-max normalization is fragile to outliers and shifts per query, whereas ranks are always comparable. $k$ (the paper's value is 60) is a smoothing constant: small $k$ lets rank-1 dominate; large $k$ flattens the curve so a document ranked highly by both retrievers beats one ranked #1 by only one. That consensus-rewarding behavior, with no per-retriever weights to tune, is why RRF is the default hybrid fusion across Elasticsearch, OpenSearch, Qdrant, and Weaviate.
[IC5] "Bi-encoder vs cross-encoder — why can't you just rerank everything with the cross-encoder?" A bi-encoder encodes query and document independently, so document vectors are precomputed and ANN-searchable over millions — fast but lossy. A cross-encoder concatenates query and document into one input and runs full self-attention across both, producing a far more accurate relevance score, but it yields no reusable, indexable document vector — you need a fresh forward pass per pair at query time. Scoring a whole corpus per query is computationally infeasible, so you retrieve a high-recall candidate set of ~50–100 cheaply with hybrid + RRF and run the cross-encoder only over those.
[IC6] "Cross-encoder reranking gives great quality but +600ms p95. Make it affordable at 2,000 QPS without gutting quality. When ColBERT instead?" Self-host the reranker with dynamic batching on GPU (per-call API latency and cost won't survive 2,000 QPS), then attack the candidate count — rerank top-25 instead of top-100 if recall is already saturated there, since quality is roughly flat once the relevant doc is reliably in the set. Cache scores on (query, doc-id) for repeated head queries, and tier the path: only ambiguous/high-value queries hit the cross-encoder, the long tail is served by hybrid + RRF or a cheaper reranker. ColBERT/late interaction is the move when reranking latency is the binding constraint and you can pay the storage: its MaxSim over precomputed per-token vectors recovers much of the cross-encoder's quality at a fraction of the latency, and it can even act as the first-stage retriever — the cost is a multi-vector index that's far larger than a single-vector dense index, plus heavier ops.
[IC6] "How do you know hybrid + rerank actually helped, and not just felt better?" Evaluate retrieval and generation separately on a golden set of real queries with labeled relevant chunks. For retrieval, track recall@k, nDCG, and MRR — first confirm the candidate stage gets recall@50 high (that's the ceiling the reranker works within), then confirm the reranker improves nDCG/MRR@5. For generation, measure faithfulness and answer relevancy (RAGAS-style LLM-as-judge). Gate these in CI so a model or chunking change can't silently regress retrieval. The discipline of measuring recall before precision is what stops you from tuning a reranker that's reordering a bad candidate set.

7. Pitfalls & flashcards

  • Min-max normalizing then adding dense and BM25 scores. Scales are incomparable and query-dependent; use RRF and fuse on rank.
  • Reranking the whole corpus. Cross-encoders aren't indexable — only ever rerank a retrieved candidate set.
  • Treating recall and precision as one problem. They're two stages with two metrics; fix recall (candidate stage) before precision (reranker).
  • Skipping BM25 because "embeddings are state of the art." Embeddings lose exact tokens, codes, and rare terms by construction. Hybrid isn't a legacy hack; it's the current default.
  • Confusing a high reranker score with groundedness. Relevance ≠ the answer is present; keep a faithfulness eval downstream (/rag/evaluation).
  • Letting the two indexes drift. Separate BM25 and vector indexes must share chunk boundaries, IDs, and deletes — prefer one hybrid-capable store.
  • Using an English reranker on code/legal/multilingual data without testing. Verify on your domain; reach for multilingual models (Cohere Rerank, bge-reranker-v2-m3) when needed.
  • Over-deep reranking. Past the point where the relevant doc is reliably in the candidate set, deeper reranking only adds latency. Find that point on your eval set.

Flashcard. Dense recalls paraphrase, sparse recalls exact tokens, and their errors are disjoint — so fuse them with RRF ($\sum 1/(k+\text{rank})$, $k\approx60$) for a high-recall candidate set, then rerank with a cross-encoder over the top ~50 to get precision in the top ~5. Retrieve many cheaply, rerank few expensively. ColBERT's MaxSim over per-token vectors is the latency-friendly middle path when the cross-encoder is too slow.

8. Further reading

Next: /rag/advanced-retrieval — query transforms (HyDE, multi-query, step-back), GraphRAG, and agentic retrieval, where the model decides whether and what to retrieve (/agents/tool-use-and-mcp).

Primary sources
← More in RAG & Retrieval