Blog · Search Systems

BM25 vs Normal Search

Why counting words is a bad idea — and how BM25 judges quality, not quantity. A beginner-friendly guide with Python code and real-world analogies.

⚖️ Part 2 — BM25 (Smart Ranking)

BM25 (Best Match 25) is a ranking function used in Elasticsearch Apache Lucene Solr and many production search systems. It was first formalised in the 1990s and described in detail in Robertson & Zaragoza (2009) — The Probabilistic Relevance Framework: BM25 and Beyond. Despite being decades old, it remains one of the most effective lexical search algorithms available.

BM25 says: "Let me judge quality, not just quantity." It fixes every problem listed above by adding three ingredients.

The three key ideas

  • Term Frequency Saturation (TF)The first mention of "python" adds a lot. The 5th adds a little. The 20th adds almost nothing. Repetition is controlled — not rewarded infinitely.
  • Document Length NormalisationA short, focused document that mentions "python" twice scores better than a 5000-word document that also mentions it twice. Long documents are penalised proportionally.
  • Inverse Document Frequency (IDF)Rare words (like "BM25") get a high weight. Common words (like "the", "is") get a low weight. Matching a rare word matters more than matching a common one.

Same example — BM25 ranking

Search: "python developer"

🥇
Doc B — BM25 Winner
"python developer roadmap guide"
BM25: high ✅

Balanced usage, clear context, short & relevant.

Doc A — Penalised
"python python python developer"
BM25: medium ⚠️

High frequency but repetitive → saturation kicks in.

Doc C — Low Score
"developer"
BM25: low ❌

Missing "python" entirely.

📐 The BM25 Formula (Plain English)

BM25 Score for a single query term in a document
score(D, q) = IDF(q) × [ (TF(q,D) × (k1 + 1)) / (TF(q,D) + k1 × (1 - b + b × |D|/avgdl)) ] Where: TF(q, D) — how many times query word q appears in document D IDF(q) — how rare the word q is across all documents |D| — length of document D (word count) avgdl — average document length in the collection k1 — controls term frequency saturation (typical: 1.2 – 2.0) b — controls document length penalty (typical: 0.75)
Don't panic. You don't need to memorise this formula. What matters: the score grows with TF but flattens (saturates), shrinks for long documents, and is boosted for rare words (IDF).

What the parameters do

ParameterControlsHigher value means
k1TF saturation speedRepetition matters more before flattening
b = 1.0Full length normalisationLong docs heavily penalised
b = 0.0No length normalisationSame as raw TF (length ignored)
b = 0.75Default balanceModerate length penalty (recommended)

🔍 Head-to-Head Comparison

FeatureNormal SearchBM25
Word countingRaw count onlySaturated (diminishing returns) ✅
Keyword repetitionOver-rewarded ❌Controlled ✅
Document lengthIgnored ❌Normalised ✅
Rare vs common wordsTreated equally ❌IDF weighting ✅
Used in productionOnly in very basic systemsElasticsearch, Lucene, Solr ✅
Works without ML modelsYes ✅Yes ✅
Understands meaningNo (lexical only) ❌No (lexical only) ❌
Note for beginnersNeither normal search nor BM25 understands meaning or synonyms. That's what semantic search (embeddings) solves. BM25 is still great for exact term matching and is often used alongside embeddings in hybrid search.

🎯 Real-Life Analogy

❌ Normal Search

Bad teacher grading an exam:
"You wrote the word 'photosynthesis' 15 times — you get full marks!"

Doesn't matter if the answer makes sense. Just counting keywords.

✅ BM25

Smart teacher grading an exam:
"You explained the concept clearly and concisely, without padding. Rare technical terms used correctly. No unnecessary repetition."

Quality and relevance over raw quantity.

🐍 Python Code — Build It from Scratch

Let's implement a simple version step by step. You do not need any special library for the scratch version — just Python's standard library.

Step 1 — Normal (keyword count) search

python · normal_search.py
# ── Normal Search: just count matching words ──────────────────

def tokenize(text):
    """Lowercase and split into words."""
    return text.lower().split()


def normal_score(document, query_tokens):
    """Count how many query words appear in the document."""
    doc_tokens = tokenize(document)
    score = 0
    for word in query_tokens:
        score += doc_tokens.count(word)   # raw count — no limit
    return score


# ── Example ───────────────────────────────────────────────────
documents = [
    "python python python developer",          # Doc A
    "python developer roadmap guide",           # Doc B
    "developer",                                # Doc C
]

query = "python developer"
query_tokens = tokenize(query)

results = [
    (doc, normal_score(doc, query_tokens))
    for doc in documents
]
results.sort(key=lambda x: x[1], reverse=True)

for rank, (doc, score) in enumerate(results, 1):
    print(f"{rank}. Score={score}  →  {doc}")

## Output:
## 1. Score=4  →  python python python developer   ← spammy doc wins 😤
## 2. Score=2  →  python developer roadmap guide
## 3. Score=1  →  developer

Step 2 — BM25 from scratch

python · bm25_scratch.py
import math
from collections import Counter

# ── BM25 from scratch ─────────────────────────────────────────

def tokenize(text):
    return text.lower().split()


class BM25:
    def __init__(self, corpus, k1=1.5, b=0.75):
        self.k1 = k1
        self.b  = b
        self.tokenized_corpus = [tokenize(doc) for doc in corpus]
        self.doc_count  = len(corpus)
        self.avg_dl     = sum(len(d) for d in self.tokenized_corpus) / self.doc_count
        self.tf         = [Counter(d) for d in self.tokenized_corpus]

    def _idf(self, word):
        docs_with_word = sum(1 for tf in self.tf if word in tf)
        if docs_with_word == 0:
            return 0.0
        return math.log(
            (self.doc_count - docs_with_word + 0.5) /
            (docs_with_word + 0.5) + 1
        )

    def score(self, doc_index, query_tokens):
        tf   = self.tf[doc_index]
        dl   = len(self.tokenized_corpus[doc_index])
        total = 0.0
        for word in query_tokens:
            freq = tf.get(word, 0)
            idf  = self._idf(word)
            numerator   = freq * (self.k1 + 1)
            denominator = freq + self.k1 * (1 - self.b + self.b * dl / self.avg_dl)
            total += idf * (numerator / denominator)
        return round(total, 4)

    def rank(self, query):
        tokens  = tokenize(query)
        scores  = [(i, self.score(i, tokens)) for i in range(self.doc_count)]
        return sorted(scores, key=lambda x: x[1], reverse=True)


# ── Example ───────────────────────────────────────────────────
documents = [
    "python python python developer",
    "python developer roadmap guide",
    "developer",
]

bm25    = BM25(documents)
results = bm25.rank("python developer")

for rank, (idx, score) in enumerate(results, 1):
    print(f"{rank}. BM25={score}  →  {documents[idx]}")

## Output:
## 1. BM25=1.0986  →  python developer roadmap guide  ← quality wins 🏆
## 2. BM25=0.8109  →  python python python developer
## 3. BM25=0.4055  →  developer

Step 3 — Use the rank_bm25 library (recommended)

Install once: pip install rank-bm25 · PyPI · GitHub
python · rank_bm25_library.py
from rank_bm25 import BM25Okapi

# Tokenise each document into a list of words
documents = [
    "python python python developer",
    "python developer roadmap guide",
    "developer",
]
tokenized_docs = [doc.split() for doc in documents]

# Build the BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Search
query   = "python developer"
scores  = bm25.get_scores(query.split())

for idx, score in enumerate(scores):
    print(f"Doc {idx}: score={score:.4f}  →  {documents[idx]}")

## Doc 0: score=0.8109  →  python python python developer
## Doc 1: score=1.0986  →  python developer roadmap guide  ← still wins ✅
## Doc 2: score=0.4055  →  developer

# Get top-N results
top_docs = bm25.get_top_n(query.split(), documents, n=2)
print("
Top 2 results:")
for doc in top_docs:
    print(" -", doc)

Step 4 — Use BM25 in a RAG pipeline

python · bm25_rag_example.py
from rank_bm25 import BM25Okapi

# Your knowledge base (replace with real documents)
knowledge_base = [
    "Python is a popular programming language for data science.",
    "BM25 is a ranking function used in information retrieval.",
    "RAG stands for Retrieval-Augmented Generation.",
    "Elasticsearch uses BM25 as its default ranking algorithm.",
    "Vector embeddings capture semantic meaning of text.",
]

def build_bm25_index(docs):
    tokenized = [doc.lower().split() for doc in docs]
    return BM25Okapi(tokenized)

def retrieve(query, docs, bm25_index, top_k=3):
    """Retrieve the top-k most relevant documents for the query."""
    tokens = query.lower().split()
    return bm25_index.get_top_n(tokens, docs, n=top_k)

# Build index once
index = build_bm25_index(knowledge_base)

# Retrieve at query time
user_query   = "How does BM25 work in search?"
top_contexts = retrieve(user_query, knowledge_base, index)

print("Retrieved context for LLM:")
for i, ctx in enumerate(top_contexts, 1):
    print(f"  {i}. {ctx}")

# Then pass top_contexts to your LLM (OpenAI, Ollama, etc.) as context

🎮 Interactive Demo

Type a query and see Normal Search vs BM25 ranking difference live in your browser.

🚀 Why This Matters for AI Developers

You might think: "We have powerful embedding models now, why care about BM25?" Here's why it is still essential:

  • BM25 is inside Elasticsearch and OpenSearch by default. Most production search stacks still use it as the primary retrieval layer.
  • Hybrid search = BM25 + Embeddings. The best RAG systems combine both — BM25 handles exact keyword matches, embeddings handle semantic/conceptual matches. Together they give better recall than either alone.
  • BM25 needs no GPU and no model inference. It is instant, cheap, and scales to millions of documents easily.
  • BM25 works well for rare technical terms. Embeddings sometimes struggle with domain-specific jargon ("BM25Okapi", "HNSW"). BM25 finds exact matches reliably.
Recommended stack for beginners building RAGStart with BM25-only retrieval (rank-bm25). Once working, add embeddings for semantic recall and combine the scores. That hybrid pipeline is what most production systems use.

Learning path

Comments

Ask questions, share feedback, or reply to other readers. Sign in with Google to join the discussion.

Sign in with Google to add comments and replies.

Loading comments...