Hybrid Search Explained: How Dense, Sparse, and BM25 Work Together

If you have built a search system in the last two years, you have probably used dense embeddings -- high-dimensional vectors that capture semantic meaning. Dense retrieval is powerful, but it has well-documented failure modes: it struggles with exact keyword matching, rare terms, and queries where lexical overlap matters more than semantic similarity.

The solution is not to abandon dense retrieval but to combine it with complementary methods. This post explains how Lakehouse42 implements hybrid search by fusing three retrieval signals -- dense embeddings, sparse vectors, and BM25 -- using Reciprocal Rank Fusion (RRF).

The Three Retrieval Signals

Dense Embeddings

Dense embeddings map text into a continuous vector space (typically 768 or 1024 dimensions) where semantically similar texts are close together. We use BGE-M3, a multi-lingual, multi-granularity model from BAAI that produces high-quality embeddings across 100+ languages.

Strengths:

Captures semantic similarity ("car" matches "automobile")

Handles paraphrasing and synonym variation

Works across languages without translation

Good at conceptual queries ("documents about financial risk management")

Weaknesses:

Misses exact keyword matches when the keyword is rare

Can retrieve semantically related but factually irrelevant results

Embedding quality degrades for domain-specific jargon without fine-tuning

High dimensionality means higher storage and compute costs

Sparse Vectors (Learned Sparse Retrieval)

Sparse vectors assign non-zero weights to a vocabulary of tokens, producing a high-dimensional but mostly-zero vector. Unlike dense embeddings, each dimension corresponds to a specific token, making the representation interpretable. BGE-M3 produces sparse vectors alongside dense embeddings in a single forward pass.

Strengths:

Captures term importance with learned weights (not just frequency)

Handles rare and domain-specific terms better than dense retrieval

Naturally interpretable -- you can see which terms contributed to the match

Efficient storage and retrieval via inverted index structures

Weaknesses:

Limited semantic generalization (does not infer synonyms as well as dense)

Vocabulary-dependent -- out-of-vocabulary terms get no representation

Less effective for conceptual or abstract queries

BM25 (Best Match 25)

BM25 is a classical probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. It has been the backbone of information retrieval for three decades.

Strengths:

Excellent at exact phrase and keyword matching

No training required -- works out of the box on any corpus

Extremely fast with inverted indexes

Well-understood theoretical properties

Handles Boolean-style queries naturally

Weaknesses:

No semantic understanding ("car" does not match "automobile")

Cannot handle cross-language queries

Sensitive to vocabulary mismatch between query and document

Does not capture context or word order (bag-of-words assumption)

Why Fusion Works

Each retrieval method has complementary strengths and weaknesses. Dense retrieval excels at semantic matching but misses keywords. BM25 excels at keyword matching but misses semantics. Sparse retrieval bridges the gap with learned term importance.

Consider this query: "GDPR data processing agreement template"

Dense retrieval might return documents about "data privacy regulations" and "compliance frameworks" -- semantically related but not exactly what was asked.

BM25 will find documents containing the exact terms "GDPR," "data processing agreement," and "template" -- high precision but might miss relevant documents that use different terminology.

Sparse retrieval will weight "GDPR" and "data processing agreement" heavily while also capturing related terms the model learned during training.

By fusing all three, you get documents that are both semantically relevant AND contain the right keywords, with learned term importance breaking ties.

Reciprocal Rank Fusion (RRF)

The fusion step is where the magic happens. We use Reciprocal Rank Fusion (RRF), a simple but remarkably effective algorithm introduced by Cormack, Clarke, and Buettcher in 2009.

The RRF score for a document $d$ given multiple ranked lists is:

RRF(d) = SUM( 1 / (k + rank_i(d)) ) for each ranking i

Where:

rank_i(d) is the rank of document d in the i-th retrieval method's results

k is a constant (typically 60) that controls how much lower-ranked documents are penalized

The sum is over all retrieval methods that returned document d

Why RRF Over Other Fusion Methods

We evaluated several fusion strategies:

Linear combination -- Normalize scores from each method and take a weighted sum. The problem: scores from different methods are not on comparable scales. Dense cosine similarity (0-1) is fundamentally different from BM25 scores (unbounded). Normalization helps but introduces its own biases.

Learning-to-rank -- Train a model to combine features from each retrieval method. Produces excellent results but requires labeled training data, which most enterprises do not have at deployment time.

RRF -- Uses only rank positions, not scores. This sidesteps the score normalization problem entirely. It requires no training data. And empirically, it performs within 2-3% of learning-to-rank models while being dramatically simpler.

In our benchmarks, RRF with k=60 consistently outperforms any individual retrieval method:

Method	NDCG@10 (BEIR avg)	MRR@10
Dense only (BGE-M3)	0.44	0.40
Sparse only (BGE-M3)	0.41	0.37
BM25 only	0.38	0.35
Dense + BM25 (RRF)	0.48	0.44
Dense + Sparse + BM25 (RRF)	0.51	0.47

The three-signal fusion outperforms the best single method by 15.9% on NDCG@10.

Implementation in Lakehouse42

Here is how hybrid search works in our system, step by step:

Step 1: Query Encoding

When a search query arrives, we encode it simultaneously with BGE-M3 to produce both a dense embedding (1024-dim float vector) and a sparse vector (variable-length token-weight pairs). We also tokenize the query for BM25 scoring.

Step 2: Parallel Retrieval

Three retrieval paths execute concurrently:

Dense path -- Approximate nearest neighbor search over the chunks table's dense_embedding column using HNSW index. Returns top-N candidates with cosine similarity scores.

Sparse path -- Inverted index lookup using the sparse vector's non-zero tokens, scored by dot product with stored sparse vectors. Returns top-N candidates.

BM25 path -- Full-text search via DuckDB's FTS extension over the chunks table's content column. Returns top-N candidates with BM25 scores.

Step 3: Metadata Filtering

Before fusion, all three result sets are filtered by metadata constraints (organization_id, date range, document type, tags, etc.). This filtering happens at the storage layer, not in application code, so it benefits from Iceberg's partition pruning and predicate pushdown.

Step 4: RRF Fusion

The three ranked lists are fused using RRF with k=60. Documents that appear in multiple lists get boosted; documents that rank highly in all three lists rise to the top.

Step 5: Re-ranking (Optional)

For high-precision use cases, we optionally apply a cross-encoder re-ranker to the top-K fused results. The cross-encoder jointly encodes the query and each candidate document, producing more accurate relevance scores at the cost of higher latency (typically +50-100ms).

Tuning Hybrid Search

While RRF works well out of the box, there are several knobs for optimization:

Per-method weight in RRF -- You can weight the three signals differently. For keyword-heavy domains (legal, medical), increasing the BM25 weight improves results. For conceptual search (research, strategy), increasing the dense weight helps.

k parameter -- Lower k values (e.g., 20) amplify the importance of top ranks. Higher k values (e.g., 100) flatten the rank distribution, giving more weight to documents that appear across multiple methods even if they rank lower in each.

Retrieval depth (N) -- How many candidates to retrieve from each method before fusion. Deeper retrieval (N=200-500) improves recall at the cost of latency. For most workloads, N=100 provides a good balance.

Sparse vector threshold -- Minimum weight for a sparse dimension to be included. Higher thresholds reduce noise but may miss relevant terms. We default to 0.0 (include all non-zero dimensions).

Practical Results

We ran a controlled experiment with an enterprise customer in the legal sector. The corpus contained 50,000 contracts, and the evaluation set was 200 queries with human-labeled relevance judgments.

Configuration	Precision@10	Recall@10	F1@10	Avg Latency
Dense only	0.62	0.45	0.52	85ms
BM25 only	0.58	0.51	0.54	35ms
Dense + BM25 (RRF)	0.71	0.58	0.64	95ms
Dense + Sparse + BM25 (RRF)	0.74	0.63	0.68	110ms
+ Cross-encoder re-rank	0.81	0.63	0.71	195ms

The full hybrid pipeline with re-ranking achieved 36.5% higher F1 than dense-only retrieval, with latency still well under 200ms.

Conclusion

Hybrid search is not a theoretical improvement -- it is a practical necessity for production retrieval systems. Dense embeddings, sparse vectors, and BM25 each capture different aspects of relevance, and combining them with RRF produces consistently superior results with minimal engineering overhead.

At Lakehouse42, hybrid search is the default for every query. You do not need to choose between semantic and keyword search -- you get both, fused intelligently, on every request.

Want to benchmark hybrid search on your own data? Start a free trial and run your evaluation in under an hour. For enterprise evaluations, contact our team.

The Three Retrieval Signals

Dense Embeddings

Strengths:

Captures semantic similarity ("car" matches "automobile")

Handles paraphrasing and synonym variation

Works across languages without translation

Good at conceptual queries ("documents about financial risk management")

Weaknesses:

Misses exact keyword matches when the keyword is rare

Can retrieve semantically related but factually irrelevant results

Embedding quality degrades for domain-specific jargon without fine-tuning

High dimensionality means higher storage and compute costs

Sparse Vectors (Learned Sparse Retrieval)

Strengths:

Captures term importance with learned weights (not just frequency)

Handles rare and domain-specific terms better than dense retrieval

Naturally interpretable -- you can see which terms contributed to the match

Efficient storage and retrieval via inverted index structures

Weaknesses:

Limited semantic generalization (does not infer synonyms as well as dense)

Vocabulary-dependent -- out-of-vocabulary terms get no representation

Less effective for conceptual or abstract queries

BM25 (Best Match 25)

Strengths:

Excellent at exact phrase and keyword matching

No training required -- works out of the box on any corpus

Extremely fast with inverted indexes

Well-understood theoretical properties

Handles Boolean-style queries naturally

Weaknesses:

No semantic understanding ("car" does not match "automobile")

Cannot handle cross-language queries

Sensitive to vocabulary mismatch between query and document

Does not capture context or word order (bag-of-words assumption)

Why Fusion Works

Consider this query: "GDPR data processing agreement template"

Dense retrieval might return documents about "data privacy regulations" and "compliance frameworks" -- semantically related but not exactly what was asked.

BM25 will find documents containing the exact terms "GDPR," "data processing agreement," and "template" -- high precision but might miss relevant documents that use different terminology.

Sparse retrieval will weight "GDPR" and "data processing agreement" heavily while also capturing related terms the model learned during training.

By fusing all three, you get documents that are both semantically relevant AND contain the right keywords, with learned term importance breaking ties.

Reciprocal Rank Fusion (RRF)

The fusion step is where the magic happens. We use Reciprocal Rank Fusion (RRF), a simple but remarkably effective algorithm introduced by Cormack, Clarke, and Buettcher in 2009.

The RRF score for a document $d$ given multiple ranked lists is:

RRF(d) = SUM( 1 / (k + rank_i(d)) ) for each ranking i

Where:

rank_i(d) is the rank of document d in the i-th retrieval method's results

k is a constant (typically 60) that controls how much lower-ranked documents are penalized

The sum is over all retrieval methods that returned document d

Why RRF Over Other Fusion Methods

We evaluated several fusion strategies:

In our benchmarks, RRF with k=60 consistently outperforms any individual retrieval method:

Method	NDCG@10 (BEIR avg)	MRR@10
Dense only (BGE-M3)	0.44	0.40
Sparse only (BGE-M3)	0.41	0.37
BM25 only	0.38	0.35
Dense + BM25 (RRF)	0.48	0.44
Dense + Sparse + BM25 (RRF)	0.51	0.47

The three-signal fusion outperforms the best single method by 15.9% on NDCG@10.

Implementation in Lakehouse42

Here is how hybrid search works in our system, step by step:

Step 1: Query Encoding

Step 2: Parallel Retrieval

Three retrieval paths execute concurrently:

Dense path -- Approximate nearest neighbor search over the chunks table's dense_embedding column using HNSW index. Returns top-N candidates with cosine similarity scores.

Sparse path -- Inverted index lookup using the sparse vector's non-zero tokens, scored by dot product with stored sparse vectors. Returns top-N candidates.

BM25 path -- Full-text search via DuckDB's FTS extension over the chunks table's content column. Returns top-N candidates with BM25 scores.

Step 3: Metadata Filtering

Step 4: RRF Fusion

The three ranked lists are fused using RRF with k=60. Documents that appear in multiple lists get boosted; documents that rank highly in all three lists rise to the top.

Step 5: Re-ranking (Optional)

Tuning Hybrid Search

While RRF works well out of the box, there are several knobs for optimization:

Sparse vector threshold -- Minimum weight for a sparse dimension to be included. Higher thresholds reduce noise but may miss relevant terms. We default to 0.0 (include all non-zero dimensions).

Practical Results

We ran a controlled experiment with an enterprise customer in the legal sector. The corpus contained 50,000 contracts, and the evaluation set was 200 queries with human-labeled relevance judgments.

Configuration	Precision@10	Recall@10	F1@10	Avg Latency
Dense only	0.62	0.45	0.52	85ms
BM25 only	0.58	0.51	0.54	35ms
Dense + BM25 (RRF)	0.71	0.58	0.64	95ms
Dense + Sparse + BM25 (RRF)	0.74	0.63	0.68	110ms
+ Cross-encoder re-rank	0.81	0.63	0.71	195ms

The full hybrid pipeline with re-ranking achieved 36.5% higher F1 than dense-only retrieval, with latency still well under 200ms.

Conclusion

At Lakehouse42, hybrid search is the default for every query. You do not need to choose between semantic and keyword search -- you get both, fused intelligently, on every request.

Want to benchmark hybrid search on your own data? Start a free trial and run your evaluation in under an hour. For enterprise evaluations, contact our team.

Hybrid Search Explained: How Dense, Sparse, and BM25 Work Together

The Three Retrieval Signals

Dense Embeddings

Sparse Vectors (Learned Sparse Retrieval)

BM25 (Best Match 25)

Why Fusion Works

Reciprocal Rank Fusion (RRF)

Why RRF Over Other Fusion Methods

Implementation in Lakehouse42

Step 1: Query Encoding

Step 2: Parallel Retrieval

Step 3: Metadata Filtering

Step 4: RRF Fusion

Step 5: Re-ranking (Optional)

Tuning Hybrid Search

Practical Results

Conclusion

Ready to transform your knowledge management?

Hybrid Search Explained: How Dense, Sparse, and BM25 Work Together

The Three Retrieval Signals

Dense Embeddings

Sparse Vectors (Learned Sparse Retrieval)

BM25 (Best Match 25)

Why Fusion Works

Reciprocal Rank Fusion (RRF)

Why RRF Over Other Fusion Methods

Implementation in Lakehouse42

Step 1: Query Encoding

Step 2: Parallel Retrieval

Step 3: Metadata Filtering

Step 4: RRF Fusion

Step 5: Re-ranking (Optional)

Tuning Hybrid Search

Practical Results

Conclusion

Ready to transform your knowledge management?