xhluca / bm25s Goto Github PK

View Code? Open in Web Editor NEW

733.0 4.0 25.0 212 KB

Fast lexical search library implementing BM25 in Python using Numpy and Scipy

Home Page: https://bm25s.github.io

License: MIT License

Python 100.00%

bm25 bm25-l bm25-plus lexical-search retrieval robertson search okapi-bm25 rag information-retrieval

bm25s's Introduction

NPM | PyPi | Huggingface | StackOverflow | Medium

"xhluca" stands for "X. H. Lu, CAnada"

bm25s's People

Contributors

Stargazers

Watchers

bm25s's Issues

how to dynamic add/delete documents

Hi, would it be possible to add or delete documents to the indexed corpus and index with low time cost?

On-the-fly stemming

Right now, stemming is done after the strings are split and converted to IDs:

bm25s/bm25s/tokenization.py

Lines 152 to 177 in 73c7dea

 # Step 2: Stem the tokens if a stemmer is provided 

 if stemmer is not None: 

 if hasattr(stemmer, "stemWords"): 

 stemmer_fn = stemmer.stemWords 

 elif callable(stemmer): 

 stemmer_fn = stemmer 

 else: 

 error_msg = "Stemmer must have a `stemWord` method, or be callable. For example, you can use the PyStemmer library." 

 raise ValueError(error_msg) 

 # Now, we use the stemmer on the token_to_index dictionary to get the stemmed tokens 

 tokens_stemmed = stemmer_fn(unique_tokens) 

 vocab = set(tokens_stemmed) 

 vocab_dict = {token: i for i, token in enumerate(vocab)} 

 stem_id_to_stem = {v: k for k, v in vocab_dict.items()} 

 # We create a dictionary mapping the stemmed tokens to their index 

 doc_id_to_stem_id = { 

 token_to_index[token]: vocab_dict[stem] 

 for token, stem in zip(unique_tokens, tokens_stemmed) 

 } 

 # Now, we simply need to replace the tokens in the corpus with the stemmed tokens 

 for i, doc_ids in enumerate(tqdm(corpus_ids, desc="Stem Tokens", leave=leave, disable=not show_progress)): 

 corpus_ids[i] = [doc_id_to_stem_id[doc_id] for doc_id in doc_ids] 

 else: 

 vocab_dict = token_to_index

However, it can probably be done here instead:

bm25s/bm25s/tokenization.py

Lines 141 to 142 in 73c7dea

 if token not in token_to_index: 

 token_to_index[token] = len(token_to_index)

Probably would need:

token_to_stem = {}  # do we need this? maybe useful to keep, though stemmer_fn should be sufficient
token_to_index = {}  # this is used to convert tokens to stem id (the true id) on the fly
stem_to_index = {}  # only tracks stems and their ID (this is the true vocab dict)

# example: changing -> chang, changed -> chang
# chang's stem_id = 42
# stem_to_index = {"chang": 42}  --> real vocab_dict
# token_to_index = {"changing": 42, "changed": 42}

# ...

for ...:
  if token not in token_to_index:
    stem = stemmer_fn(token)
    if stem not in stem_to_index:
      stem_to_index[stem] = len(stem_to_index)
    stem_id = stem_to_index[stem]
    token_to_index[token] = stem_id  # the token should now map to the stem's ID
  token_id = token_to_index[token]
# ...
vocab_dict = stem_to_index

Not Working for langchain Documents

from langchain.docstore.document import Document

doc = Document(page_content="text", metadata={"source": "local"})

Instead of a list of string, I want to give a list of Langchain documents but it's not working

Other language than english for the stopwords list

Thanks for writing this repo.

This project currently supports only English stopwords, but there is the possibility to pass also our list of stopwords. like for example Serbian French or Chinese language.

bm25s/bm25s/tokenization.py

Lines 37 to 48 in 73c7dea

 def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]: 

 if stopwords in ["english", "en", True]: 

 return STOPWORDS_EN 

 elif stopwords in [None, False]: 

 return [] 

 elif isinstance(stopwords, str): 

 raise ValueError( 

 f"{stopwords} not recognized. Only default English stopwords are currently supported. " 

 "Please input a list of stopwords" 

 ) 

 else: 

 return stopwords

Could we add a list of stopwords of other languages in the repo by opening a PR or do you plan to incorporate it or not at all?

I could be open to adding other languages :)

Thread safe search

Amazing work on this! Works great.

Is retrieval thread-safe? On a glance, it seems like it should be, but I have trouble using multi-threading in a notebook. It crashes most of the time, but when it works the results are correct.

I should add that I have trouble irrespective of backend = jax or numpy.

可以增量更新索引吗？

如题。实际情况下有很多场景需要实时增量更新。

如果有可能的话，建议参考whoosh、tantivy等，封装为一个比较完整的全文检索底层库。

Updating an index for batch indexing

Hi! Is it possible to use update an existing index, e.g. for batch indexing and larger-than-memory datasets?
Unfortunately, this does not work:

import Stemmer
import bm25s

batch_0 = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
]

batch_1 = [
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]


def index_corpus(corpus_batch):
    corpus_tokens = bm25s.tokenize([d['text'] for d in corpus_batch], stopwords="en", stemmer=stemmer)
    retriever.index(corpus_tokens)


def query_corpus(query):
    query_tokens = bm25s.tokenize(query, stemmer=stemmer)

    all_batches = batch_0 + batch_1
    results, scores = retriever.retrieve(query_tokens, corpus=all_batches, k=2)

    for i in range(results.shape[1]):
        doc, score = results[0, i], scores[0, i]
        print(f"Rank {i + 1} (score: {score:.2f}): {doc}")


retriever = bm25s.BM25()
stemmer = Stemmer.Stemmer("english")

index_corpus(batch_0)
index_corpus(batch_1)

query_corpus("what is a fish?")

Can you query without a tokenization step?

In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.

x_corpus = [...]

y_corpus = [
    "fooo",
    "does the fish purr like a cat?"
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

class XEntity:
  corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
  x_retriever = bm25s.BM25()
  x_retriever.index(corpus_tokens)

class YEntity:
  corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
  y_retriever = bm25s.BM25()
  y_retriever.index(corpus_tokens)

corpus_tokens for y_corpus

Tokenized(ids=[[8], [10, 7, 9, 2, 11, 4, 13, 6, 5, 12], [7, 0, 3, 14, 1]], vocab={'creatur': 0, 'swim': 1, 'like': 2, 'live': 3, 'bird': 4, 'can': 5, 'anim': 6, 'fish': 7, 'fooo': 8, 'purr': 9, 'doe': 10, 'cat': 11, 'fli': 12, 'beauti': 13, 'water': 14})

If I tokenize independently...

a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})

Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?

query_from_y = precomputed_representation_of_a_query_without_tokenization_step
ranked_results = x_retriever.retrieve(query_from_y, k=5)

🚨Before submitting an issue, read this 🚨

There are many reason you might want to open an issue:

You have a question about how to use the library
You have an idea how the library could be improved, and would like to discuss it
You would like to highlight a general discussion
You found a bug
You would like to outline a new feature to add to the library

Please only open an issue for the latter two cases: bug (report or fix) and a concrete new feature. Everything else will be moved to Discussions

How to apply bm25s to languages such as Chinese?

I think the bm25s library is great and very efficient. I would like to use it in my project.

But how to apply bm25s to languages such as Chinese? Can you provide some examples?

Minor bug: `show_progress` not propagated in `BM25.index`

Just a minor bug: the show_progress setting is not getting propagated from BM25.index to BM25.build_index_from_ids. Currently __init__.py line 352-356:

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress
            )

should be

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress,
                show_progress=show_progress
            )

Using with postgres?

I'm using Supabase for one of my projects and it has about 2.3M rows. Currently the data is only fetch using certain attributes as Full Text Search is pretty slow. Is there any way we can use BM25s with the existing infrastructure?

Thanks for your response.

[Feature request] Document metadata and filtering

Hi, I've used bm25s on a fairly large production dataset, and I'm super-impressed by the speed!! Having fumbled around with rank_bm25 quite a bit and suffered through the pain of its slow speed and large memory usage, I would say the speed and memory efficiency of bm25s is absolutely mind-blowing.

As a suggestion, I think it might be useful to add support for document metadata and filtering. The metadata would be fields like "author", "title", "date", etc. which wouldn't be included in the keyword tokenization, but can be used for filtering the search results during query (e.g. only searching for documents from a specific author).

Thanks!

Order-based matching of corpus metadata to to tokens

Hi! Thanks a lot for this nice little library, the timing is perfect :)

If I want to provide additional metadata in my corpus, how is it matched to the indexed corpus tokens at retrieval time? Is it entirely based on both structures having the same order such that the indices apply?

Just looking for a quick confirmation before using this in a real-world application :)

Quick example to illustrate:

import bm25s
import Stemmer

# corpus with metadata
corpus = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]

stemmer = Stemmer.Stemmer("english")

# build corpus without metadata
corpus_tokens = bm25s.tokenize([d['text'] for d in corpus], stopwords="en", stemmer=stemmer)
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):

    # doc is a dictionary with "id" and "text" - how are they matched?
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i + 1} (score: {score:.2f}): {doc}")

[Feature Request] Support attaching metadata to the corpus

It can be very helpful to attach metadata to a corpus, that is not indexed, but still returned during retrieval.

For example, a super naive approach:

corpus = [
  {"text": "Hello world", "metadata": {"source": "internet"}},
  ...
]

The main motivation for me is providing a more first-class integration in llama-index 😄 I can serialize the entire TextNode object to make saving/loading very smooth. But I think overall this would be a super valuable feature

Pre-computed TF-IDF

Is it possible to pass a pre-computed TF-IDF matrix (with the shape [documents, vocabulary])?

Capability Inquiry: Retrieving Specific JSON Records Based on Text

Hi I am considering using the BM25 library for a project where I need to efficiently retrieve JSON records based on textual content matches. My data is structured in JSON format, each with several fields.

Use Case

When I input a query, such as "mountain cycling", I want to retrieve the top K JSON records that best match this query based on the content of the 'chunk' field.

Example of json

    {
        "chunk_id": 1,
        "chunk": "mountain cycling",
        "vocabulary_id": "SPORTS001",
        "vocabulary_name": "Global Sports Vocabulary",
        "concept_code": "MTCYCL001",
        "concept_name": "Mountain Cycling",
        "domain": "Outdoor Sports",
        "validity": true,
        "source": "Sports Encyclopedia"
    },

Questions

Does the BM25 library support indexing and retrieving directly from JSON structures like the ones provided above, particularly focusing on a specific field for text matching?
Setup Advice: If direct JSON handling is supported, could you provide guidance or documentation on how to set up the library for this specific use case?

	# Step 2: Stem the tokens if a stemmer is provided
	if stemmer is not None:
	if hasattr(stemmer, "stemWords"):
	stemmer_fn = stemmer.stemWords
	elif callable(stemmer):
	stemmer_fn = stemmer
	else:
	error_msg = "Stemmer must have a `stemWord` method, or be callable. For example, you can use the PyStemmer library."
	raise ValueError(error_msg)

	# Now, we use the stemmer on the token_to_index dictionary to get the stemmed tokens
	tokens_stemmed = stemmer_fn(unique_tokens)
	vocab = set(tokens_stemmed)
	vocab_dict = {token: i for i, token in enumerate(vocab)}
	stem_id_to_stem = {v: k for k, v in vocab_dict.items()}
	# We create a dictionary mapping the stemmed tokens to their index
	doc_id_to_stem_id = {
	token_to_index[token]: vocab_dict[stem]
	for token, stem in zip(unique_tokens, tokens_stemmed)
	}

	# Now, we simply need to replace the tokens in the corpus with the stemmed tokens
	for i, doc_ids in enumerate(tqdm(corpus_ids, desc="Stem Tokens", leave=leave, disable=not show_progress)):
	corpus_ids[i] = [doc_id_to_stem_id[doc_id] for doc_id in doc_ids]
	else:
	vocab_dict = token_to_index

	if token not in token_to_index:
	token_to_index[token] = len(token_to_index)

	def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
	if stopwords in ["english", "en", True]:
	return STOPWORDS_EN
	elif stopwords in [None, False]:
	return []
	elif isinstance(stopwords, str):
	raise ValueError(
	f"{stopwords} not recognized. Only default English stopwords are currently supported. "
	"Please input a list of stopwords"
	)
	else:
	return stopwords

xhluca / bm25s Goto Github PK

bm25s's Introduction

bm25s's People

Contributors

Stargazers

Watchers

Forkers

bm25s's Issues

Use Case

Questions

Recommend Projects

Recommend Topics

Recommend Org