Skip to main content

BM25

BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.

BM25Retriever retriever uses the rank_bm25 package.

%pip install --upgrade --quiet  rank_bm25
from langchain_community.retrievers import BM25Retriever
API Reference:BM25Retriever

Create New Retriever with Textsโ€‹

retriever = BM25Retriever.from_texts(["foo", "bar", "world", "hello", "foo bar"])

Create a New Retriever with Documentsโ€‹

You can now create a new retriever with the documents you created.

from langchain_core.documents import Document

retriever = BM25Retriever.from_documents(
[
Document(page_content="foo"),
Document(page_content="bar"),
Document(page_content="world"),
Document(page_content="hello"),
Document(page_content="foo bar"),
]
)
API Reference:Document

Use Retrieverโ€‹

We can now use the retriever!

result = retriever.invoke("foo")
result
[Document(metadata={}, page_content='foo'),
Document(metadata={}, page_content='foo bar'),
Document(metadata={}, page_content='hello'),
Document(metadata={}, page_content='world')]

Preprocessing Functionโ€‹

Pass a custom preprocessing function to the retriever to improve search results. Tokenizing text at the word level can enhance retrieval, especially when using vector stores like Chroma, Pinecone, or Faiss for chunked documents.

import nltk

nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize

retriever = BM25Retriever.from_documents(
[
Document(page_content="foo"),
Document(page_content="bar"),
Document(page_content="world"),
Document(page_content="hello"),
Document(page_content="foo bar"),
],
k=2,
preprocess_func=word_tokenize,
)

result = retriever.invoke("bar")
result
[Document(metadata={}, page_content='bar'),
Document(metadata={}, page_content='foo bar')]

Was this page helpful?