amenra / retriv Goto Github PK

A Python Search Engine for Humans 🥸

License: MIT License

Python 98.85% Makefile 1.15%

bm25 dense-retrieval hybrid-retrieval information-retrieval numba search search-engine search-engine-optimization semantic-search sparse-retrieval tf-idf

retriv's People

Contributors

Stargazers

Watchers

Forkers

nashid gokunwu aryopg martiansideofthemoon shadowlinyf maw501 klarefor1 sayed-ameer chansongjo ar4s-eth alex2awesome trunghlt marioalessandronapoli wojciechkusa ego justram juliuslipp yuan776

retriv's Issues

Error while generating package metadata

× Encountered error while generating package metadata.
╰─> See below output.

(venv) celso@capri:~$ pip install retriv
Collecting retriv
  Using cached retriv-0.1.4-py3-none-any.whl (20 kB)
Requirement already satisfied: numpy in ./projects/venvs/venv/lib/python3.10/site-packages (from retriv) (1.22.4)
Collecting optuna
  Using cached optuna-3.0.5-py3-none-any.whl (348 kB)
Collecting indxr
  Using cached indxr-0.1.1-py3-none-any.whl (8.7 kB)
Collecting cyhunspell
  Using cached CyHunspell-1.3.4.tar.gz (2.7 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Downloading https://github.com/hunspell/hunspell/archive/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz
      Extracting /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external
      Traceback (most recent call last):
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 226, in pkgconfig
          raise RuntimeError(response)
      RuntimeError: /bin/sh: 1: pkg-config: not found
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/setup.py", line 46, in <module>
          hunspell_config = pkgconfig('hunspell', language='c++')
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 259, in pkgconfig
          lib_path = build_hunspell_package(os.path.join(BASE_DIR, 'external', 'hunspell-1.6.2'))
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 189, in build_hunspell_package
          check_call(['autoreconf', '-vfi'])
        File "/usr/lib/python3.10/subprocess.py", line 364, in check_call
          retcode = call(*popenargs, **kwargs)
        File "/usr/lib/python3.10/subprocess.py", line 345, in call
          with Popen(*popenargs, **kwargs) as p:
        File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: 'autoreconf'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(venv) celso@capri:~$

fsspec==2023.12.2 does not allow '**' in path

  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 175, in index
    self.index_aux(
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 137, in index_aux
    self.ann_searcher.build()
  File "/lib/python3.8/site-packages/retriv/dense_retriever/ann_searcher.py", line 27, in build
    index, index_infos = build_index(
  File "/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 205, in build_index
    embedding_reader = EmbeddingReader(
  File "/lib/python3.8/site-packages/embedding_reader/embedding_reader.py", line 20, in __init__
    self.reader = NumpyReader(embeddings_folder)
  File "/lib/python3.8/site-packages/embedding_reader/numpy_reader.py", line 67, in __init__
    self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 15, in get_file_list
    return _get_file_list(path, file_format)
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 46, in _get_file_list
    file_paths = fs.glob(glob_pattern)
  File "/lib/python3.8/site-packages/fsspec/spec.py", line 606, in glob
    pattern = glob_translate(path + ("/" if ends_with_sep else ""))
  File "/lib/python3.8/site-packages/fsspec/utils.py", line 734, in glob_translate
    raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component

this error does not occur with fsspec==2023.5.0

what if I use retriv in a developing environment without any network（such as a docker container）

have a good experience at using retriv ,

but nltk will download some files when users start to use the package, this means that users developing in offline enviroment can't use it at all.

I think let users to define dicts for their own usage is more people-friendly.

[Feature Request] Use WAND Top-K Retrieval

@inproceedings{petri2013exploring,
  title={Exploring the magic of WAND},
  author={Petri, Matthias and Culpepper, J Shane and Moffat, Alistair},
  booktitle={Proceedings of the 18th Australasian Document Computing Symposium},
  pages={58--65},
  year={2013}
}

I believe if you're using inverted index and token - docs list, using the WAND Top-K Retrieval Algorithm can speedup retrieval for small K in large documents. I'm not sure whether it's relevant to this project. I've once implemented this https://raw.githubusercontent.com/hockyy/ir-pa-2/main/bsbi.py

[Feature Request] build index on a sequence of json/jsonl files

When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.

[Feature Request] Show progress status when performing `msearch` or `bsearch`

Hello, I am having a good experience with retriv :-)
Is there an approach to add progress status when performing msearch or bsearch?

Multiprocess error triggers while trying example code

Hi AmenRa,

First of all I'd like to thank you for your efforts.
I'm trying to use retriv, but when I use the sample code you provided in the readme, I get the following error:

Building TDF matrix:   0%|                                                                                                                                                               | 0/4 [00:00<?, ?it/s]    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1268, in _count_vocab
    for doc in raw_documents:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\tqdm\std.py", line 1182, in __iter__
    for obj in iterable:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multipipe\multipipe.py", line 28, in to_generator
    with Pool(n_threads) as pool:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 119, in Pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 336, in _Popen
    return Popen(process_obj)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    return Pool(processes, initializer, initargs, maxtasksperchild,

By just running the example code:

# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")

Could you please help me fix this issue?

I would like to see `retriv` part of the Search Benchmark, the Game

@fulmicoton created a quite nice Search Benchmark where engines can be benchmarked against a fixed set of queries. It would be nice to integrate retriv into it and see how it performs against optimized ones. Maybe the "search engine for the common man" ends up being quite competitive too.

EDIT: Adding the main repo for reference

ANN_Searcher not dealing with -1 returned by faiss_index.search()

Traceback (most recent call last):
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 251, in search
    doc_ids = self.map_internal_ids_to_original_ids(doc_ids)
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in map_internal_ids_to_original_ids
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in <listcomp>
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
KeyError: -1

Update "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80 to

    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

would fix the problem.

Image search

Hi,
Thank you for a nice Elastic/Pinecone replacement 🙂
A small question (or perhaps a feature request): is it possible to use different neural networks for indexing and retrieval?
I mean, with CLIP model one first calculates vectors of images, and then uses second part of the same model to encode text queries.

[Feature Request] Add documents to index after initializing?

Hi,

I understand that there are reasons why we only want to do indexing once, since there are corpus-level statistics that need to be calculated.

But is there any way to index a huge batch of documents, then index a few more, assuming they are from the same distribution?

Alex

[Feature Request] Include some TfidfVectorizer args

Would it be beneficial to include some TfidfVectorizer args such as ngram_range and max_features?

Cache directory not in home directory?

Dear @AmenRa,
Thanks for releasing the clean retrieval library! I was wondering if it's possible to set custom cache directories? By default, it seems like the index is being stored in ~/.retriv?

Thank you!

[BUG] Segmentation fault (core dumped)

First of all, thank you for this excellent library.

Describe the bug

Building TDF matrix: 100%|███████████████████████████████████████████████| 13905/13905 [00:34<00:00, 408.07it/s]
Building inverted index: 100%|███████████████████████████████████████| 148864/148864 [00:10<00:00, 14750.18it/s]
Batch search:   0%|                                                                   | 0/13905 [00:00<?, ?it/s]
Segmentation fault      (core dumped)

I am getting Segmentation fault (core dumped) when using bsearch in Sparse Retriever.

Current environment

CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- available: True
- version: 12.1
Packages:
- absl-py: 2.0.0
- accelerate: 0.24.1
- aiohttp: 3.8.6
- aiosignal: 1.3.1
- alembic: 1.12.1
- antlr4-python3-runtime: 4.9.3
- appdirs: 1.4.4
- async-timeout: 4.0.3
- attrs: 23.1.0
- autofaiss: 2.15.8
- beautifulsoup4: 4.12.2
- bleach: 6.1.0
- cachetools: 5.3.2
- cbor: 1.0.0
- cbor2: 5.5.1
- certifi: 2023.7.22
- charset-normalizer: 3.3.2
- click: 8.1.7
- colorlog: 6.7.0
- contourpy: 1.2.0
- cramjam: 2.7.0
- cycler: 0.12.1
- dill: 0.3.7
- docker-pycreds: 0.4.0
- embedding-reader: 1.5.1
- faiss-cpu: 1.7.4
- fastparquet: 2023.10.1
- filelock: 3.13.1
- fire: 0.4.0
- fonttools: 4.44.0
- frozenlist: 1.4.0
- fsspec: 2023.10.0
- gitdb: 4.0.11
- gitpython: 3.1.40
- google-auth: 2.23.4
- google-auth-oauthlib: 1.1.0
- greenlet: 3.0.1
- grpcio: 1.59.2
- huggingface-hub: 0.17.3
- hydra-core: 1.3.2
- idna: 3.4
- ijson: 3.2.3
- indxr: 0.1.5
- inscriptis: 2.3.2
- ir-datasets: 0.5.5
- jinja2: 3.1.2
- joblib: 1.3.2
- kaggle: 1.5.16
- keybert: 0.8.3
- kiwisolver: 1.4.5
- krovetzstemmer: 0.8
- lightning-utilities: 0.9.0
- llvmlite: 0.41.1
- lxml: 4.9.3
- lz4: 4.3.2
- mako: 1.3.0
- markdown: 3.5.1
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.1
- mdurl: 0.1.2
- mpmath: 1.3.0
- multidict: 6.0.4
- multipipe: 0.1.0
- multiprocess: 0.70.15
- networkx: 3.2.1
- nltk: 3.8.1
- nmslib: 2.1.1
- numba: 0.58.1
- numpy: 1.26.1
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.18.1
- nvidia-nvjitlink-cu12: 12.3.52
- nvidia-nvtx-cu12: 12.1.105
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- oneliner-utils: 0.1.2
- optuna: 3.4.0
- orjson: 3.9.10
- packaging: 23.2
- pandas: 1.5.3
- pillow: 10.1.0
- pip: 23.3.1
- protobuf: 4.23.4
- psutil: 5.9.6
- pyarrow: 12.0.1
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pyautocorpus: 0.1.12
- pybind11: 2.6.1
- pygments: 2.16.1
- pyparsing: 3.1.1
- pystemmer: 2.0.1
- python-dateutil: 2.8.2
- python-slugify: 8.0.1
- pytorch-lightning: 2.1.1
- pytorch-metric-learning: 2.3.0
- pytz: 2023.3.post1
- pyyaml: 6.0.1
- ranx: 0.3.18
- regex: 2023.10.3
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- retriv: 0.2.3
- rich: 13.6.0
- rsa: 4.9
- safetensors: 0.4.0
- scikit-learn: 1.3.2
- scipy: 1.11.3
- seaborn: 0.13.0
- sentence-transformers: 2.2.2
- sentencepiece: 0.1.99
- sentry-sdk: 1.39.1
- setproctitle: 1.3.3
- setuptools: 68.2.2
- six: 1.16.0
- smmap: 5.0.1
- soupsieve: 2.5
- sqlalchemy: 2.0.23
- sympy: 1.12
- tabulate: 0.9.0
- tensorboard: 2.15.1
- tensorboard-data-server: 0.7.2
- termcolor: 2.3.0
- text-unidecode: 1.3
- threadpoolctl: 3.2.0
- tokenizers: 0.14.1
- torch: 2.1.0
- torchaudio: 2.1.0
- torchmetrics: 1.2.0
- torchvision: 0.16.0
- tqdm: 4.66.1
- transformers: 4.35.0
- trec-car-tools: 2.6
- triton: 2.1.0
- typing-extensions: 4.8.0
- unidecode: 1.3.7
- unlzw3: 0.2.2
- urllib3: 2.0.7
- wandb: 0.16.1
- warc3-wet: 0.2.3
- warc3-wet-clueweb09: 0.2.5
- webencodings: 0.5.1
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.2
- zlib-state: 0.1.6
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.13
- release: 5.15.0-88-generic
- version: #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023

Input file format

Hi, I'm pretty new to this. Can you give an example what a input file in jsonl format looks like?

Minimal example for Hybrid Search fails

First, I really like this project !

Respective sparse and dense examples work with minimal setup.

Issue is with the hybrid mode.

Here is the code:

from retriv import HybridRetriever

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

hr = HybridRetriever(
    # Shared params ------------------------------------------------------------
    index_name="hybrid-index",
    # Sparse retriever params --------------------------------------------------
    sr_model="bm25",
    min_df=1,
    tokenizer="whitespace",
    stemmer="english",
    stopwords="english",
    do_lowercasing=True,
    do_ampersand_normalization=True,
    do_special_chars_normalization=True,
    do_acronyms_normalization=True,
    do_punctuation_removal=True,
    # Dense retriever params ---------------------------------------------------
    dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
    normalize=True,
    max_length=128,
    use_ann=True,
)

he = hr.index(collection)
he.search(
  query="witches",    # What to search for        
  return_docs=True,          # Default value, return the text of the documents
  cutoff=5,                # 100 is Default value, number of results to return
)

Error:

Building TDF matrix: 100%|██████████| 4/4 [00:01<00:00,  3.41it/s]
Building inverted index: 100%|██████████| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|██████████| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|██████████| 1/1 [00:00<00:00, 20661.60it/s]
100%|██████████| 1/1 [00:00<00:00, 99.58it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_45461/1793453458.py:32 in <module>                                                │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py'                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve │
│ r.py:255 in search                                                                               │
│                                                                                                  │
│   252 │   │   """                                                                                │
│   253 │   │                                                                                      │
│   254 │   │   sparse_results = self.sparse_retriever.search(query, False, 1_000)                 │
│ ❱ 255 │   │   dense_results = self.dense_retriever.search(query, False, 1_000)                   │
│   256 │   │   hybrid_results = self.merger.fuse([sparse_results, dense_results])                 │
│   257 │   │   return (                                                                           │
│   258 │   │   │   self.prepare_results(                                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever │
│ /dense_retriever.py:251 in search                                                                │
│                                                                                                  │
│   248 │   │   │   │   self.load_embeddings()                                                     │
│   249 │   │   │   doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff)       │
│   250 │   │                                                                                      │
│ ❱ 251 │   │   doc_ids = self.map_internal_ids_to_original_ids(doc_ids)                           │
│   252 │   │                                                                                      │
│   253 │   │   return (                                                                           │
│   254 │   │   │   self.prepare_results(doc_ids, scores)                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in map_internal_ids_to_original_ids                                                        │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in <listcomp>                                                                              │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: -1

Qrels and Run query ids do not match

Trial 0 failed with parameters: {'b': 0.37, 'k1': 9.600000000000001} because of the following error: 
AssertionError('Qrels and Run query ids do not match').

I guess this issue happens because in qrels, there are not all possible scores for all possible results in the run.
Wouldn't it be interesting to filter the run dictionary for only the evaluated cases that occur in qrels ?

HybridRetriever does not respect cutoff when calling sub-retrievers and the merger

In HybridRetriever.search:

        sparse_results = self.sparse_retriever.search(query, False, 1_000)
        dense_results = self.dense_retriever.search(query, False, 1_000)
        hybrid_results = self.merger.fuse([sparse_results, dense_results])

cutoff is not passed down.

potential fix:

        sparse_results = self.sparse_retriever.search(query, False, cutoff)
        dense_results = self.dense_retriever.search(query, False, cutoff)
        hybrid_results = self.merger.fuse([sparse_results, dense_results], cutoff)

[Issue] Remove CyHunSpell as Required Dependency

CyHunSpell can't be installed in non-debian based Linux.

using another ANN

@AmenRa :

Thanks for good project.
Suggest to use Qdrant library for in memory search as alternative to FAISS.
Can help to implement it.

Thansk !

Does Advanced Retrieve support semantic searching?

Does Advanced Retrieve support semantic searching? If not are there future plans for it?

[Feature Request] Pointwise mutual information

It would be great to include other retrieval approaches such as Pointwise Mutual Information.

Getting Out of Memory Error

Hi,

I have a dataset which has around 2million rows and each text is no more than 20 tokens. I tried building using SparseRetreiver

from retriv import SparseRetriever

sr = SparseRetriever(
  index_name="bm25",
  model="bm25",
  min_df=1,
  tokenizer="whitespace",
  stemmer="english",
  stopwords="english",
  do_lowercasing=True,
  do_ampersand_normalization=True,
  do_special_chars_normalization=True,
  do_acronyms_normalization=True,
  do_punctuation_removal=False,
)
collections = [{"id": id, "text": text} for id, text in zip(ids, descs)]
sr.index(collections)

My disc space is around 14GB and RAM is around 96GB with 24 processors. is there any option to chunk the data and index it one chunk at a time?

[Feature Request] Ability to search and index documents with other metadata

Hi,
Nice choice with War Pigs in the example. :)

Been looking for a pure-python based search engine, ever since Whoosh stopped being actively developed.

Realize this library is just getting started out, but was wondering if it is possible to add the ability to search and filter by metadata as well.

For example

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses", "album": "War pigs"},
  {"id": "doc_2", "text": "Finished with my woman", "album": "Paranoid"}
]

I might want to search for all lines where the album has the word "pigs" for eg.

Also, is the search OR by default, as in find ANY of the words in the query. Can we search with AND and other BOOLEAN operators, as well as proximity and phrase search? Lucene has these features.

Any plan to combine knn search with the text search?

[Feature Request] Allow GPU for query embedding

Hi,

Really great and useful library. Thanks for making it available for everyone.

I am mostly applying this for quick evaluation of search models and realized that DenseRetriever is only applying GPU for the documents when building the index, but not for the queries when running search, which makes it a bit slow for larger sets of queries.

Would you consider adding use_gpu keyword argument to search, msearch and bsearch methods of DenseRetriever and HybridRetriever? Looks like it could be handled similarly as in the index method.

Just in case someone else is having the same issue, this problem can be avoided by directly setting the encoder device before running search as follows:

use_gpu = True

dr = dr.index(collection, use_gpu=use_gpu)

if use_gpu:
    dr.encoder.change_device('cuda')

r = dr.bsearch(queries=queries)
dr.encoder.change_device('cpu')

Thanks!

[Feature Request] How to specify where to save the index?

How to specify where to save the index?

Dose this supports Chinese?

Compare retriv's permance to rank_bm25 and pyserini

Hi! I see that retriv's speed is really impressive in seepd.md. Did you also compare their performances?

An simple example repo for Chinese curpus about use your framework to generate topk file . Hope to help some bro.

https://github.com/DuTim/retriv_chinese_example_bm25_topk

autotune Function Usage Example

I am looking for an example of how to structure the queries and qrels parameters of the autotune function because I searched in the repo and didn't find any example for that. Precisely, what should be the keys and values of queries dict? and similarly for qrels dict?

Thanks in advance for your help.

Doc strings

Hi, I can't help but notice that this codebase is quite missing in doc strings.

This really hinders my experience as someone trying to use it. In particular, auto completion does not work.
I know that you've made your documentation on markdown files, and why not, but it does not explain every function that I may want to use.

Is there an ETA on getting those ?

Pre-computed TF-IDF

Is it possible to pass a pre-computed TF-IDF matrix to the Sparse Retriever?

[BUG] Corrupted log when using SearchEngine

Hi again,

I'm stuck with a strange behavior, that from my tests seems to be related to the use of the SearchEngine.
I'm using a SingletonLogger that logs everything to stdout and persists that log onto a file.
When the program runs, the index takes a bit of time to be calculated and if I check the logfile, I can correctly see everything printed until this point. After the SearchEngine finishes calculating the index, the first row of the logfile becomes a series of nul values.
Below a sample of code and of the log file.
Can anyone give me pointers to solve this?

_logger = Logger()

[code doing stuff, collecting collection mainly]

_logger.info("Building index...")
SearchEngine("new-index").index(collection, show_progress=False)
_logger.info("Index built.")

logfile.log

HybridRetriever raise KeyError: -1 if the len of doc less than 1_000

The cutoff of msearch for HybridRetriever is hardcode to 1_000, which makes map_internal_ids_to_original_ids raise KeyError when doc len less than 1_000

retriv/retriv/hybrid_retriever.py

Lines 254 to 255 in c9baa01

 sparse_results = self.sparse_retriever.search(query, False, 1_000) 

 dense_results = self.dense_retriever.search(query, False, 1_000)

Thus, map_internal_ids_to_original_ids should be:

def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:
    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

BM25 time complexity

What would be the time complexity of sparse search using BM25?

	sparse_results = self.sparse_retriever.search(query, False, 1_000)
	dense_results = self.dense_retriever.search(query, False, 1_000)