Hybrid Search
Sixty years of search distilled into one architecture. Neither keyword nor semantic search is enough alone — hybrid search combines both, and the math is simpler than you think.
60 Years of Finding the Right Document
Hybrid search is not a single invention. It is the convergence of two independent research lineages — keyword matching and semantic understanding — that evolved in parallel for decades before someone thought to combine them. Understanding why each approach was built, what it solved, and where it failed is essential to understanding why hybrid search works.
This is not background reading. It is the fastest way to build intuition for when to favor keyword weight vs. semantic weight in your own system.
Boolean Search: AND, OR, NOT
The earliest computerized information retrieval systems — MEDLARS at the National Library of Medicine (1964), DIALOG at Lockheed (1966) — used Boolean logic. A query was a logical expression: authentication AND failure AND NOT timeout. A document either matched or it did not. There was no concept of ranking, relevance scores, or "better" matches.
# Boolean retrieval — binary match, no ranking
query = "authentication AND failure AND OAuth2"
results = []
for doc in corpus:
has_auth = "authentication" in doc.lower()
has_fail = "failure" in doc.lower()
has_oauth = "oauth2" in doc.lower()
if has_auth and has_fail and has_oauth:
results.append(doc) # Either in or out — no scoreBoolean search dominated for two decades. Trained librarians became expert query builders, constructing elaborate nested expressions. But the model had a fatal flaw: it returned either too many results (with broad queries) or zero results (with specific ones). There was no middle ground, no way to say "this document is a better match than that one."
tf-idf: The Birth of Ranking
Karen Sparck Jones at Cambridge published "A Statistical Interpretation of Term Specificity and Its Application in Retrieval," introducing inverse document frequency (IDF). Her insight: a term that appears in many documents ("the", "is") carries less information than a term appearing in few ("OAuth2", "mitochondria"). Combined with Gerard Salton's term frequency (TF) work at Cornell, tf-idf became the first practical relevance scoring system.
"The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."
tf-idf turned search from a binary gate into a ranked list. It was the standard for 30 years — and the conceptual ancestor of BM25.
BM25: The Algorithm That Still Powers Search
Stephen Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford at City University London published the Okapi BM25 ranking function as part of the TREC-3 experiments. BM25 fixed two problems with tf-idf:
- Term frequency saturation — in tf-idf, repeating a word 10x gives 10x the score. BM25 applies a logarithmic saturation curve so keyword-stuffed documents do not dominate.
- Document length normalization — longer documents naturally contain more term occurrences. BM25 normalizes by document length, controlled by parameter b.
# BM25 scoring for a single term t in document d # IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1) # score(t,d) = IDF(t) * (tf(t,d) * (k1 + 1)) / (tf(t,d) + k1 * (1 - b + b * |d|/avgdl)) # # k1 = 1.2 (term frequency saturation) # b = 0.75 (length normalization) # N = total documents, df(t) = docs containing t # |d| = doc length, avgdl = average doc length
— Robertson, S. et al. (1994). Okapi at TREC-3. TREC. See also: Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR.
Thirty years later, BM25 remains the default first-stage retriever in Elasticsearch, Solr, Lucene, and virtually every production search system on the planet. It is simple, interpretable, requires no training data, and is remarkably hard to beat on keyword-heavy queries. But it has one fundamental limitation: it only matches surface forms. A query for "car repair" will never find a document about "automobile maintenance."
DPR: Dense Passage Retrieval
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih at Facebook AI Research published the paper that made neural retrieval practical at scale. DPR used two separate BERT encoders — one for queries, one for passages — trained with contrastive learning on question-answer pairs from Natural Questions.
"We show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework."
— Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP. 5,000+ citations.
DPR outperformed BM25 on open-domain QA benchmarks by a wide margin — 65.2% top-20 accuracy vs 59.1% on Natural Questions. For the first time, a neural retriever beat the classical baseline convincingly on a major benchmark.
But DPR also revealed the complementary nature of the two approaches. On entity-heavy queries ("what year did the Berlin Wall fall"), BM25 still won because the exact terms "Berlin Wall" are powerful signals. On paraphrase-heavy queries ("when did the barrier dividing East and West Germany come down"), DPR dominated. Neither was universally better.
ColBERT, ANCE & the Retrieval Arms Race
Omar Khattab and Matei Zaharia introduced ColBERT (2020), which kept per-token embeddings instead of compressing to a single vector — enabling fine-grained token-level matching while remaining fast via late interaction. Xiong et al. introduced ANCE (2021), using the model's own predictions to mine hard negatives during training. Each advance showed that dense retrieval could be improved, but also that BM25 remained stubbornly competitive on certain query types.
— Khattab, O. & Zaharia, M. (2020). ColBERT. SIGIR. — Xiong, L. et al. (2021). ANCE. ICLR.
The Hybrid Consensus
By 2021, the research community reached a clear consensus: combining sparse and dense retrieval consistently outperforms either alone. The BEIR benchmark (Thakur et al., 2021) tested retrievers across 18 diverse datasets and showed that no single method dominated everywhere. BM25 was best on some datasets (BioASQ, TREC-COVID), dense retrievers on others (NQ, HotpotQA), and hybrid combinations beat both on average.
Vector databases (Weaviate, Qdrant, Milvus, Pinecone) raced to add built-in hybrid search. Elasticsearch added kNN search. The Reciprocal Rank Fusion (RRF) algorithm, originally proposed by Cormack, Clarke, and Butt in 2009 for metasearch, became the standard fusion method — simple, effective, and requiring no training.
— Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS.
— Cormack, G., Clarke, C., & Butt, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR.
The throughline: 1960 → 2026
Four eras. Two parallel tracks that finally merged:
Each generation solved a limitation of the last. Hybrid search is not a hack — it is the natural conclusion of sixty years of retrieval research.
The Problem: Neither Search Is Perfect
In production systems, you quickly discover that pure semantic search and pure keyword search each have critical blind spots. These are not edge cases — they affect 20-40% of real queries depending on your domain.
Semantic Search Failures
- -Misses exact matches: "error code 0x8007045D"
- -Struggles with proper nouns, product IDs, version numbers
- -May return "similar" results instead of what user typed
- -Embeddings conflate antonyms ("hot" and "cold" are similar in embedding space)
- -Fails on rare domain terms not well-represented in training data
Keyword Search Failures
- -Misses synonyms: "car repair" vs "automobile maintenance"
- -Fails on paraphrases and rewordings entirely
- -No understanding of context, intent, or question structure
- -Vocabulary mismatch between query and documents (the "lexical gap")
- -Overweights rare terms that appear incidentally in irrelevant documents
Concrete Example: Same Query, Different Strengths
Query: "how to fix authentication failure in OAuth2"
BM25 returns (rank 1):
"OAuth2 authentication failure troubleshooting guide"
Exact keyword overlap on "OAuth2", "authentication", "failure"
Vector search returns (rank 1):
"Debugging login issues with identity providers"
Semantic match — "login issues" means "authentication failure"
Hybrid search returns both — the exact match AND the semantic match, giving users the best of both worlds. The user who types "OAuth2" explicitly gets the keyword-matched guide. The user who describes the problem in their own words gets the semantic match.
The Architecture: Two Paths, One Result
Hybrid search runs both retrievers in parallel and combines their ranked results using a fusion algorithm. The query goes through two independent paths, and a score-agnostic fusion step merges the results.
Architecture Overview
Input
Query
Path 1
BM25
Sparse / Keyword
Path 2
Vector
Dense / Semantic
Fusion
RRF
Rank-based
Output
Merged Results
Key insight: BM25 and vector search run independently. Neither knows about the other. The fusion step only sees ranks, not raw scores.
BM25: The Keyword Engine
BM25 (Best Matching 25) is the industry-standard keyword ranking algorithm, used in Elasticsearch, Solr, and Lucene since the mid-2000s. It scores documents based on term frequency and inverse document frequency, with saturation to prevent keyword stuffing.
from rank_bm25 import BM25Okapi
import numpy as np
documents = [
"OAuth2 authentication failure troubleshooting guide",
"How to configure SSO with SAML providers",
"Debugging login issues with identity providers",
"REST API authentication best practices",
"Token refresh flow implementation guide",
"Kubernetes pod authentication with service accounts",
"CORS preflight request failures in browser",
]
# Tokenize (production: use proper tokenizer with stemming)
tokenized_docs = [doc.lower().split() for doc in documents]
# Build BM25 index — no training data needed
bm25 = BM25Okapi(tokenized_docs, k1=1.5, b=0.75)
# Search
query = "authentication failure OAuth2"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)
# Ranked results
ranked = np.argsort(scores)[::-1]
print("BM25 Results:")
for idx in ranked[:3]:
print(f" score={scores[idx]:.3f}: {documents[idx]}")BM25 Results: score=3.142: OAuth2 authentication failure troubleshooting guide score=1.307: REST API authentication best practices score=1.052: Kubernetes pod authentication with service accounts
Exact term overlap drives the ranking. "Debugging login issues" scores low because it shares no keywords with the query, despite being semantically relevant.
BM25 Parameters — What They Actually Control
k1 (default: 1.2-1.5)
Controls term frequency saturation. At k1=0, TF has no effect (binary match). As k1 increases, repeated terms get more weight. Most systems use 1.2 (Elasticsearch) or 1.5 (Lucene).
b (default: 0.75)
Controls length normalization. At b=0, document length is ignored. At b=1, full normalization. If your documents are roughly equal length, b=0.3 works well. For mixed lengths, keep 0.75.
Vector Search: The Semantic Engine
Vector search encodes queries and documents as dense numerical vectors using a transformer model, then finds similar documents via nearest-neighbor search. We covered embeddings in detail in Lesson 0.1 — here is a quick refresher in the context of hybrid retrieval.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
documents = [
"OAuth2 authentication failure troubleshooting guide",
"How to configure SSO with SAML providers",
"Debugging login issues with identity providers",
"REST API authentication best practices",
"Token refresh flow implementation guide",
"Kubernetes pod authentication with service accounts",
"CORS preflight request failures in browser",
]
# Encode all documents once (store these vectors)
doc_embeddings = model.encode(documents, normalize_embeddings=True)
# Encode query at search time
query = "fix login problems"
query_embedding = model.encode(query, normalize_embeddings=True)
# Cosine similarity = dot product (since normalized)
scores = np.dot(doc_embeddings, query_embedding)
ranked = np.argsort(scores)[::-1]
print("Vector Search Results:")
for idx in ranked[:3]:
print(f" score={scores[idx]:.3f}: {documents[idx]}")Vector Search Results: score=0.721: Debugging login issues with identity providers score=0.654: OAuth2 authentication failure troubleshooting guide score=0.589: REST API authentication best practices
"fix login problems" matches "Debugging login issues" despite sharing only one keyword ("login"). The embedding model understands that "fix" is similar to "debugging" and "problems" is similar to "issues."
Reciprocal Rank Fusion (RRF): The Math
RRF was proposed by Cormack, Clarke, and Butt at the University of Waterloo in 2009 for combining results from multiple search engines. Its genius is simplicity: it only uses rank positions, not raw scores. This means you can combine BM25 scores (which are unbounded floats) with cosine similarities (which are -1 to 1) without any normalization.
The RRF Formula
RRF(d) = 1 / (k + rankBM25(d)) + 1 / (k + rankvector(d))
k is a constant, typically 60. It prevents the top-ranked document from dominating the fusion score (without k, rank 1 would contribute 1.0 while rank 2 contributes 0.5 — a 2x gap for a single position change). With k=60, rank 1 contributes 1/61 = 0.0164 and rank 2 contributes 1/62 = 0.0161 — a much smoother gradient.
rank is 1-indexed (the top result has rank 1). Documents not returned by a retriever are treated as having rank infinity (RRF contribution = 0).
Worked Example: RRF Step by Step
Query: "how to fix authentication failure in OAuth2". Both retrievers return their top 5.
BM25 RANKED LIST
1. Doc A: "OAuth2 auth failure guide"
2. Doc D: "REST API auth best practices"
3. Doc F: "K8s pod authentication"
4. Doc E: "Token refresh flow"
5. Doc B: "SSO with SAML"
VECTOR RANKED LIST
1. Doc C: "Debugging login issues"
2. Doc A: "OAuth2 auth failure guide"
3. Doc D: "REST API auth best practices"
4. Doc F: "K8s pod authentication"
5. Doc G: "CORS preflight failures"
RRF CALCULATION (k=60)
FINAL HYBRID RANKING
1. Doc A (0.03252) — appeared in BOTH lists, ranked high in both
2. Doc D (0.03200) — appeared in BOTH lists
3. Doc F (0.03150) — appeared in BOTH lists
4. Doc C (0.01639) — vector only, but ranked #1 there
5. Doc E (0.01563) — BM25 only, rank 4
Key observation: Doc A wins because it ranked well in both lists. Doc C was the top vector result but only appeared in one list, so it ranks lower than documents that appeared in both. This is the core power of RRF: it rewards consensus across retrievers.
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
def reciprocal_rank_fusion(
ranked_lists: list[list[int]],
k: int = 60
) -> list[tuple[int, float]]:
"""
Combine multiple ranked lists using RRF.
Args:
ranked_lists: List of ranked document ID lists.
Each inner list is ordered by relevance (best first).
k: RRF constant (default 60, from original paper).
Returns:
List of (doc_id, rrf_score) tuples, sorted by score descending.
"""
rrf_scores: dict[int, float] = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
if doc_id not in rrf_scores:
rrf_scores[doc_id] = 0.0
rrf_scores[doc_id] += 1.0 / (k + rank)
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_search(
query: str,
documents: list[str],
bm25: BM25Okapi,
model: SentenceTransformer,
doc_embeddings: np.ndarray,
top_k: int = 20,
k: int = 60
) -> list[tuple[int, float, float, float]]:
"""
Full hybrid search pipeline.
Returns: [(doc_idx, rrf_score, bm25_score, vector_score), ...]
"""
# Path 1: BM25
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_ranking = np.argsort(-bm25_scores)[:top_k].tolist()
# Path 2: Vector
query_emb = model.encode(query, normalize_embeddings=True)
vector_scores = np.dot(doc_embeddings, query_emb)
vector_ranking = np.argsort(-vector_scores)[:top_k].tolist()
# Fusion
fused = reciprocal_rank_fusion([bm25_ranking, vector_ranking], k=k)
# Attach original scores for debugging
results = []
for doc_idx, rrf_score in fused:
results.append((
doc_idx, rrf_score,
float(bm25_scores[doc_idx]),
float(vector_scores[doc_idx])
))
return results
# --- Usage ---
documents = [
"OAuth2 authentication failure troubleshooting guide",
"How to configure SSO with SAML providers",
"Debugging login issues with identity providers",
"REST API authentication best practices",
"Token refresh flow implementation guide",
]
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query = "how to fix authentication failure in OAuth2"
results = hybrid_search(query, documents, bm25, model, doc_embeddings)
print("Hybrid Search Results:")
for idx, rrf, bm25_s, vec_s in results[:5]:
print(f" RRF={rrf:.5f} | BM25={bm25_s:.3f} | Vec={vec_s:.3f}")
print(f" {documents[idx]}")Weighted Hybrid: Tuning the Balance
Standard RRF weights both sources equally. In practice, you often want to bias toward one retriever based on your domain. There are two common approaches: alpha blending (weighted RRF) and score interpolation (normalized score combination).
def weighted_rrf(
bm25_ranking: list[int],
vector_ranking: list[int],
alpha: float = 0.5,
k: int = 60
) -> list[tuple[int, float]]:
"""
Weighted RRF: alpha controls the balance.
alpha = 0.5 -> equal weight (standard RRF)
alpha = 0.7 -> 70% keyword, 30% semantic
alpha = 0.3 -> 30% keyword, 70% semantic
"""
scores: dict[int, float] = {}
for rank, doc_id in enumerate(bm25_ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + alpha / (k + rank)
for rank, doc_id in enumerate(vector_ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + (1 - alpha) / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)Favor Keyword (alpha = 0.7)
Use when queries contain:
- - Error codes and stack traces
- - Product SKUs or model numbers
- - Legal document identifiers
- - API endpoints and function names
- - Medical codes (ICD-10, CPT)
Equal Weight (alpha = 0.5)
Best as a starting point when:
- - Query types are mixed/unknown
- - General-purpose search
- - You have no evaluation data yet
- - Internal knowledge bases
- - E-commerce product search
Favor Semantic (alpha = 0.3)
Use when users describe problems:
- - Natural language questions
- - FAQ and support content
- - Conceptual/research queries
- - Multi-lingual search
- - Users with non-expert vocabulary
Practical Guidance on Tuning Alpha
Start at alpha=0.5. Collect a set of 50-100 real user queries with relevance judgments (even just binary relevant/not relevant). Run a sweep of alpha from 0.1 to 0.9 in steps of 0.1 and measure NDCG@10 or MRR. The optimal alpha is rarely 0.5 — it tends to cluster around 0.3-0.4 for natural-language-heavy domains and 0.6-0.7 for identifier-heavy domains. Some production systems use a query classifier to dynamically set alpha per query.
Benchmarks: What the Numbers Actually Show
The BEIR benchmark (Thakur et al., 2021) is the standard for evaluating retrieval systems across diverse domains. It tests zero-shot transfer: models are not fine-tuned on the target dataset. The table below shows NDCG@10 results for BM25, a representative dense retriever (contriever-msmarco), and hybrid fusion.
BEIR Benchmark Results (NDCG@10, Zero-Shot)
| Dataset | Domain | BM25 | Dense | Hybrid | Gain |
|---|---|---|---|---|---|
| MS MARCO | Web search | 0.228 | 0.407 | 0.431 | +5.9% |
| Natural Questions | QA | 0.329 | 0.498 | 0.536 | +7.6% |
| TREC-COVID | Biomedical | 0.656 | 0.596 | 0.712 | +8.5% |
| FiQA | Finance | 0.236 | 0.329 | 0.368 | +11.9% |
| SciFact | Scientific | 0.665 | 0.677 | 0.721 | +6.5% |
| NFCorpus | Nutrition | 0.325 | 0.328 | 0.358 | +9.1% |
| DBPedia | Entity | 0.313 | 0.292 | 0.341 | +8.9% |
| HotpotQA | Multi-hop QA | 0.603 | 0.638 | 0.672 | +5.3% |
Blue-highlighted BM25 scores indicate datasets where BM25 outperforms the dense retriever. Sources: BEIR benchmark (Thakur et al., 2021); hybrid numbers from RRF fusion experiments reported in Ma et al. (2022) and Chen et al. (2024).
When BM25 Wins Alone
TREC-COVID and DBPedia are domains where queries contain specific technical terms ("SARS-CoV-2 spike protein", "entity:Berlin_Wall"). Exact matching is a powerful signal here. Dense models trained on web text struggle with out-of-domain terminology. But even in these domains, hybrid still beats BM25 alone by 8-9%.
The Consistency Argument
Hybrid search never loses to either individual method by more than 1-2% on any dataset, but regularly wins by 5-12%. In production, where you face diverse query types from unpredictable users, this worst-case guarantee matters more than peak performance on any single benchmark.
Production Implementation: Three Databases
In production, you use a vector database with built-in hybrid search rather than implementing RRF yourself. Here are complete, copy-pastable examples for the three most popular options.
Weaviate
Built-in hybrid, alpha parameterimport weaviate
from weaviate.classes.query import HybridFusion
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
collection = client.collections.get("Document")
# Hybrid search with alpha blending
# alpha=0: pure BM25, alpha=1: pure vector, alpha=0.5: equal
response = collection.query.hybrid(
query="authentication failure OAuth2",
alpha=0.5,
fusion_type=HybridFusion.RELATIVE_SCORE, # or RANKED (RRF)
limit=10,
return_metadata=["score", "explain_score"],
)
for obj in response.objects:
print(f"{obj.metadata.score:.4f}: {obj.properties['title']}")
client.close()Weaviate v4+ supports both RRF (RANKED) and relative score fusion. The alpha parameter only applies to relative score fusion; RRF uses equal weight by default.
Qdrant
Prefetch + fusion architecturefrom qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
client = QdrantClient("localhost", port=6333)
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
query_text = "authentication failure OAuth2"
query_vector = model.encode(query_text).tolist()
# Qdrant uses prefetch to run both retrievals, then fuse
results = client.query_points(
collection_name="documents",
prefetch=[
# Dense retrieval path
models.Prefetch(
query=query_vector,
using="dense",
limit=20,
),
# Sparse retrieval path (BM25-like via SPLADE or bag-of-words)
models.Prefetch(
query=models.SparseVector(
indices=[1, 42, 1337], # token IDs
values=[0.8, 0.6, 0.9], # weights
),
using="sparse",
limit=20,
),
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=10,
)
for point in results.points:
print(f"{point.score:.5f}: {point.payload['title']}")Qdrant requires you to provide sparse vectors explicitly (e.g., from SPLADE or a BM25 tokenizer). It does not have a built-in BM25 index — the sparse vectors are your keyword signal.
Elasticsearch
RRF via rank_rrf retriever (8.14+)# Elasticsearch 8.14+ — native RRF retriever
# This replaces the older bool query approach
PUT /documents
{
"mappings": {
"properties": {
"content": { "type": "text" },
"embedding": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
}
}
}
}
# Hybrid search with native RRF
GET /documents/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": "authentication failure OAuth2"
}
}
}
},
{
"knn": {
"field": "embedding",
"query_vector": [0.12, -0.03, 0.47],
"k": 10,
"num_candidates": 100
}
}
],
"rank_window_size": 50,
"rank_constant": 60
}
},
"size": 10
}Elasticsearch 8.14 introduced the rrf retriever, replacing the workaround of using bool.should with sub_searches. The rank_constant parameter is the k value in the RRF formula.
Common Pitfalls
Pitfall 1: Normalizing Scores Before Fusion
A frequent mistake is to normalize BM25 and vector scores to [0,1] and then average them. This sounds reasonable but is fragile: BM25 score distributions vary wildly depending on query length, vocabulary, and corpus size. Min-max normalization over a single query's results is dominated by outliers. RRF avoids this entirely by using only rank positions.
# DON'T DO THIS — fragile score normalization bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min()) vec_norm = (vec_scores - vec_scores.min()) / (vec_scores.max() - vec_scores.min()) final = 0.5 * bm25_norm + 0.5 * vec_norm # Unstable, distribution-dependent # DO THIS — rank-based fusion (RRF) fused = reciprocal_rank_fusion([bm25_ranked, vector_ranked], k=60) # Stable
Pitfall 2: Insufficient Prefetch Depth
If you fetch top-10 from each retriever and then fuse, a document ranked #11 in both lists (which RRF would rank highly) is invisible. Fetch 2-5x your final result count from each retriever. For 10 final results, prefetch 30-50 from each side. The RRF computation itself is O(n) and negligible compared to the retrieval cost.
Pitfall 3: Ignoring the Tokenizer Mismatch
BM25 and your embedding model use different tokenizers. BM25 typically uses whitespace + stemming (Porter or Snowball). Embedding models use BPE or WordPiece. A term like "OAuth2" might be a single BM25 token but get split into ["O", "Auth", "2"] by the embedding model. This means the two retrievers see the same query differently — which is actually a feature, not a bug. It increases the diversity of retrieved results.
Beyond RRF: Other Fusion Methods
RRF is the dominant method, but it is not the only option. The research community continues to explore alternatives, each with different trade-offs.
Learned Fusion
Train a small model (linear, gradient-boosted tree) to combine raw scores from multiple retrievers. Requires labeled relevance data. Used at Google, Bing, and Amazon internally. Can outperform RRF by 3-5% but requires ongoing training data collection.
— Ma, X. et al. (2022). A Replication Study of Dense Passage Retriever. arXiv.
Convex Combination (CC)
Normalize scores to a common scale and linearly interpolate:score = alpha * norm_bm25 + (1-alpha) * norm_vec. Simpler than RRF but sensitive to normalization. Weaviate's RELATIVE_SCORE mode implements this.
BGE-M3: Unified Sparse-Dense
BAAI's BGE-M3 model generates both dense and sparse (learned SPLADE-like) vectors from a single model in a single forward pass. This eliminates the need for a separate BM25 index — the model itself learns what to match lexically vs. semantically.
Hybrid + Reranker (Two-Stage)
The production-grade pattern: use hybrid search as a first stage to cheaply retrieve 50-100 candidates, then apply a cross-encoder reranker (e.g., Cohere Rerank, BGE-Reranker) to re-score the top results. The reranker sees query and document together, enabling deeper semantic matching.
Covered in detail in Lesson 3.2: Reranking.
Key Takeaways
- 1
Neither search type is sufficient alone — BM25 misses synonyms, vector search misses exact terms. Real user queries exercise both failure modes.
- 2
RRF is rank-based, not score-based — it uses only positions, so incompatible score scales (BM25 floats vs. cosine similarity) are not a problem. k=60 is the standard. No training data required.
- 3
Hybrid search consistently beats both methods — 5-12% gain on BEIR benchmarks, and it never catastrophically fails on any query type. The worst-case guarantee matters more than peak performance.
- 4
Tune alpha on your data, not intuition — start at 0.5, sweep 0.1-0.9 with real queries. Then add a reranker on top for another 5-10% gain.
References & Further Reading
- Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28(1), 11-21.doi
- Robertson, S. et al. (1994). Okapi at TREC-3. TREC.paper
- Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.doi
- Cormack, G., Clarke, C., & Butt, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR.doi
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.arXiv
- Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR.arXiv
- Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks.arXiv
- Ma, X. et al. (2022). A Replication Study of Dense Passage Retriever. arXiv.arXiv
- Chen, J. et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv.arXiv
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.