Git Product home page Git Product logo

bm25s's Introduction

bm25s's People

Contributors

tomaarsen avatar xhluca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bm25s's Issues

On-the-fly stemming

Right now, stemming is done after the strings are split and converted to IDs:

bm25s/bm25s/tokenization.py

Lines 152 to 177 in 73c7dea

# Step 2: Stem the tokens if a stemmer is provided
if stemmer is not None:
if hasattr(stemmer, "stemWords"):
stemmer_fn = stemmer.stemWords
elif callable(stemmer):
stemmer_fn = stemmer
else:
error_msg = "Stemmer must have a `stemWord` method, or be callable. For example, you can use the PyStemmer library."
raise ValueError(error_msg)
# Now, we use the stemmer on the token_to_index dictionary to get the stemmed tokens
tokens_stemmed = stemmer_fn(unique_tokens)
vocab = set(tokens_stemmed)
vocab_dict = {token: i for i, token in enumerate(vocab)}
stem_id_to_stem = {v: k for k, v in vocab_dict.items()}
# We create a dictionary mapping the stemmed tokens to their index
doc_id_to_stem_id = {
token_to_index[token]: vocab_dict[stem]
for token, stem in zip(unique_tokens, tokens_stemmed)
}
# Now, we simply need to replace the tokens in the corpus with the stemmed tokens
for i, doc_ids in enumerate(tqdm(corpus_ids, desc="Stem Tokens", leave=leave, disable=not show_progress)):
corpus_ids[i] = [doc_id_to_stem_id[doc_id] for doc_id in doc_ids]
else:
vocab_dict = token_to_index

However, it can probably be done here instead:

bm25s/bm25s/tokenization.py

Lines 141 to 142 in 73c7dea

if token not in token_to_index:
token_to_index[token] = len(token_to_index)

Probably would need:

token_to_stem = {}  # do we need this? maybe useful to keep, though stemmer_fn should be sufficient
token_to_index = {}  # this is used to convert tokens to stem id (the true id) on the fly
stem_to_index = {}  # only tracks stems and their ID (this is the true vocab dict)

# example: changing -> chang, changed -> chang
# chang's stem_id = 42
# stem_to_index = {"chang": 42}  --> real vocab_dict
# token_to_index = {"changing": 42, "changed": 42}

# ...

for ...:
  if token not in token_to_index:
    stem = stemmer_fn(token)
    if stem not in stem_to_index:
      stem_to_index[stem] = len(stem_to_index)
    stem_id = stem_to_index[stem]
    token_to_index[token] = stem_id  # the token should now map to the stem's ID
  token_id = token_to_index[token]
# ...
vocab_dict = stem_to_index

Not Working for langchain Documents

from langchain.docstore.document import Document

doc = Document(page_content="text", metadata={"source": "local"})

Instead of a list of string, I want to give a list of Langchain documents but it's not working

Other language than english for the stopwords list

Thanks for writing this repo.

This project currently supports only English stopwords, but there is the possibility to pass also our list of stopwords. like for example Serbian French or Chinese language.

def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
if stopwords in ["english", "en", True]:
return STOPWORDS_EN
elif stopwords in [None, False]:
return []
elif isinstance(stopwords, str):
raise ValueError(
f"{stopwords} not recognized. Only default English stopwords are currently supported. "
"Please input a list of stopwords"
)
else:
return stopwords


Could we add a list of stopwords of other languages in the repo by opening a PR or do you plan to incorporate it or not at all?

I could be open to adding other languages :)

Thread safe search

Amazing work on this! Works great.

Is retrieval thread-safe? On a glance, it seems like it should be, but I have trouble using multi-threading in a notebook. It crashes most of the time, but when it works the results are correct.

I should add that I have trouble irrespective of backend = jax or numpy.

可以增量更新索引吗?

如题。实际情况下有很多场景需要实时增量更新。

如果有可能的话,建议参考whoosh、tantivy等,封装为一个比较完整的全文检索底层库。

Updating an index for batch indexing

Hi! Is it possible to use update an existing index, e.g. for batch indexing and larger-than-memory datasets?
Unfortunately, this does not work:

import Stemmer
import bm25s

batch_0 = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
]

batch_1 = [
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]


def index_corpus(corpus_batch):
    corpus_tokens = bm25s.tokenize([d['text'] for d in corpus_batch], stopwords="en", stemmer=stemmer)
    retriever.index(corpus_tokens)


def query_corpus(query):
    query_tokens = bm25s.tokenize(query, stemmer=stemmer)

    all_batches = batch_0 + batch_1
    results, scores = retriever.retrieve(query_tokens, corpus=all_batches, k=2)

    for i in range(results.shape[1]):
        doc, score = results[0, i], scores[0, i]
        print(f"Rank {i + 1} (score: {score:.2f}): {doc}")


retriever = bm25s.BM25()
stemmer = Stemmer.Stemmer("english")

index_corpus(batch_0)
index_corpus(batch_1)

query_corpus("what is a fish?")

Can you query without a tokenization step?

In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.

x_corpus = [...]

y_corpus = [
    "fooo",
    "does the fish purr like a cat?"
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

class XEntity:
  corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
  x_retriever = bm25s.BM25()
  x_retriever.index(corpus_tokens)

class YEntity:
  corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
  y_retriever = bm25s.BM25()
  y_retriever.index(corpus_tokens)

corpus_tokens for y_corpus

Tokenized(ids=[[8], [10, 7, 9, 2, 11, 4, 13, 6, 5, 12], [7, 0, 3, 14, 1]], vocab={'creatur': 0, 'swim': 1, 'like': 2, 'live': 3, 'bird': 4, 'can': 5, 'anim': 6, 'fish': 7, 'fooo': 8, 'purr': 9, 'doe': 10, 'cat': 11, 'fli': 12, 'beauti': 13, 'water': 14})

If I tokenize independently...

a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})

Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?

query_from_y = precomputed_representation_of_a_query_without_tokenization_step
ranked_results = x_retriever.retrieve(query_from_y, k=5)

🚨Before submitting an issue, read this 🚨

There are many reason you might want to open an issue:

  • You have a question about how to use the library
  • You have an idea how the library could be improved, and would like to discuss it
  • You would like to highlight a general discussion
  • You found a bug
  • You would like to outline a new feature to add to the library

Please only open an issue for the latter two cases: bug (report or fix) and a concrete new feature. Everything else will be moved to Discussions

Minor bug: `show_progress` not propagated in `BM25.index`

Just a minor bug: the show_progress setting is not getting propagated from BM25.index to BM25.build_index_from_ids. Currently __init__.py line 352-356:

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress
            )

should be

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress,
                show_progress=show_progress
            )

Using with postgres?

I'm using Supabase for one of my projects and it has about 2.3M rows. Currently the data is only fetch using certain attributes as Full Text Search is pretty slow. Is there any way we can use BM25s with the existing infrastructure?

Thanks for your response.

[Feature request] Document metadata and filtering

Hi, I've used bm25s on a fairly large production dataset, and I'm super-impressed by the speed!! Having fumbled around with rank_bm25 quite a bit and suffered through the pain of its slow speed and large memory usage, I would say the speed and memory efficiency of bm25s is absolutely mind-blowing.

As a suggestion, I think it might be useful to add support for document metadata and filtering. The metadata would be fields like "author", "title", "date", etc. which wouldn't be included in the keyword tokenization, but can be used for filtering the search results during query (e.g. only searching for documents from a specific author).

Thanks!

Order-based matching of corpus metadata to to tokens

Hi! Thanks a lot for this nice little library, the timing is perfect :)

If I want to provide additional metadata in my corpus, how is it matched to the indexed corpus tokens at retrieval time? Is it entirely based on both structures having the same order such that the indices apply?

Just looking for a quick confirmation before using this in a real-world application :)

Quick example to illustrate:

import bm25s
import Stemmer

# corpus with metadata
corpus = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]

stemmer = Stemmer.Stemmer("english")

# build corpus without metadata
corpus_tokens = bm25s.tokenize([d['text'] for d in corpus], stopwords="en", stemmer=stemmer)
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):

    # doc is a dictionary with "id" and "text" - how are they matched?
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i + 1} (score: {score:.2f}): {doc}")

[Feature Request] Support attaching metadata to the corpus

It can be very helpful to attach metadata to a corpus, that is not indexed, but still returned during retrieval.

For example, a super naive approach:

corpus = [
  {"text": "Hello world", "metadata": {"source": "internet"}},
  ...
]

The main motivation for me is providing a more first-class integration in llama-index 😄 I can serialize the entire TextNode object to make saving/loading very smooth. But I think overall this would be a super valuable feature

Pre-computed TF-IDF

Is it possible to pass a pre-computed TF-IDF matrix (with the shape [documents, vocabulary])?

Capability Inquiry: Retrieving Specific JSON Records Based on Text

Hi I am considering using the BM25 library for a project where I need to efficiently retrieve JSON records based on textual content matches. My data is structured in JSON format, each with several fields.

Use Case

When I input a query, such as "mountain cycling", I want to retrieve the top K JSON records that best match this query based on the content of the 'chunk' field.

Example of json

    {
        "chunk_id": 1,
        "chunk": "mountain cycling",
        "vocabulary_id": "SPORTS001",
        "vocabulary_name": "Global Sports Vocabulary",
        "concept_code": "MTCYCL001",
        "concept_name": "Mountain Cycling",
        "domain": "Outdoor Sports",
        "validity": true,
        "source": "Sports Encyclopedia"
    },

Questions

  1. Does the BM25 library support indexing and retrieving directly from JSON structures like the ones provided above, particularly focusing on a specific field for text matching?

  2. Setup Advice: If direct JSON handling is supported, could you provide guidance or documentation on how to set up the library for this specific use case?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.