Git Product home page Git Product logo

retriv's People

Contributors

alex2awesome avatar amenra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

retriv's Issues

Error while generating package metadata

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See below output.

(venv) celso@capri:~$ pip install retriv
Collecting retriv
  Using cached retriv-0.1.4-py3-none-any.whl (20 kB)
Requirement already satisfied: numpy in ./projects/venvs/venv/lib/python3.10/site-packages (from retriv) (1.22.4)
Collecting optuna
  Using cached optuna-3.0.5-py3-none-any.whl (348 kB)
Collecting indxr
  Using cached indxr-0.1.1-py3-none-any.whl (8.7 kB)
Collecting cyhunspell
  Using cached CyHunspell-1.3.4.tar.gz (2.7 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [27 lines of output]
      Downloading https://github.com/hunspell/hunspell/archive/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz
      Extracting /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external
      Traceback (most recent call last):
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 226, in pkgconfig
          raise RuntimeError(response)
      RuntimeError: /bin/sh: 1: pkg-config: not found
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/setup.py", line 46, in <module>
          hunspell_config = pkgconfig('hunspell', language='c++')
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 259, in pkgconfig
          lib_path = build_hunspell_package(os.path.join(BASE_DIR, 'external', 'hunspell-1.6.2'))
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 189, in build_hunspell_package
          check_call(['autoreconf', '-vfi'])
        File "/usr/lib/python3.10/subprocess.py", line 364, in check_call
          retcode = call(*popenargs, **kwargs)
        File "/usr/lib/python3.10/subprocess.py", line 345, in call
          with Popen(*popenargs, **kwargs) as p:
        File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: 'autoreconf'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(venv) celso@capri:~$ 

fsspec==2023.12.2 does not allow '**' in path

  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 175, in index
    self.index_aux(
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 137, in index_aux
    self.ann_searcher.build()
  File "/lib/python3.8/site-packages/retriv/dense_retriever/ann_searcher.py", line 27, in build
    index, index_infos = build_index(
  File "/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 205, in build_index
    embedding_reader = EmbeddingReader(
  File "/lib/python3.8/site-packages/embedding_reader/embedding_reader.py", line 20, in __init__
    self.reader = NumpyReader(embeddings_folder)
  File "/lib/python3.8/site-packages/embedding_reader/numpy_reader.py", line 67, in __init__
    self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 15, in get_file_list
    return _get_file_list(path, file_format)
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 46, in _get_file_list
    file_paths = fs.glob(glob_pattern)
  File "/lib/python3.8/site-packages/fsspec/spec.py", line 606, in glob
    pattern = glob_translate(path + ("/" if ends_with_sep else ""))
  File "/lib/python3.8/site-packages/fsspec/utils.py", line 734, in glob_translate
    raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component

this error does not occur with fsspec==2023.5.0

[Feature Request] Use WAND Top-K Retrieval

@inproceedings{petri2013exploring,
  title={Exploring the magic of WAND},
  author={Petri, Matthias and Culpepper, J Shane and Moffat, Alistair},
  booktitle={Proceedings of the 18th Australasian Document Computing Symposium},
  pages={58--65},
  year={2013}
}

I believe if you're using inverted index and token - docs list, using the WAND Top-K Retrieval Algorithm can speedup retrieval for small K in large documents. I'm not sure whether it's relevant to this project. I've once implemented this https://raw.githubusercontent.com/hockyy/ir-pa-2/main/bsbi.py

Multiprocess error triggers while trying example code

Hi AmenRa,

First of all I'd like to thank you for your efforts.
I'm trying to use retriv, but when I use the sample code you provided in the readme, I get the following error:

Building TDF matrix:   0%|                                                                                                                                                               | 0/4 [00:00<?, ?it/s]    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1268, in _count_vocab
    for doc in raw_documents:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\tqdm\std.py", line 1182, in __iter__
    for obj in iterable:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multipipe\multipipe.py", line 28, in to_generator
    with Pool(n_threads) as pool:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 119, in Pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 336, in _Popen
    return Popen(process_obj)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    return Pool(processes, initializer, initargs, maxtasksperchild,

By just running the example code:

# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")

Could you please help me fix this issue?

ANN_Searcher not dealing with -1 returned by faiss_index.search()

Traceback (most recent call last):
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 251, in search
    doc_ids = self.map_internal_ids_to_original_ids(doc_ids)
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in map_internal_ids_to_original_ids
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in <listcomp>
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
KeyError: -1

Update "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80 to

    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

would fix the problem.

Image search

Hi,
Thank you for a nice Elastic/Pinecone replacement ๐Ÿ™‚
A small question (or perhaps a feature request): is it possible to use different neural networks for indexing and retrieval?
I mean, with CLIP model one first calculates vectors of images, and then uses second part of the same model to encode text queries.

[Feature Request] Add documents to index after initializing?

Hi,

I understand that there are reasons why we only want to do indexing once, since there are corpus-level statistics that need to be calculated.

But is there any way to index a huge batch of documents, then index a few more, assuming they are from the same distribution?

Alex

Cache directory not in home directory?

Dear @AmenRa,
Thanks for releasing the clean retrieval library! I was wondering if it's possible to set custom cache directories? By default, it seems like the index is being stored in ~/.retriv?

Thank you!

[BUG] Segmentation fault (core dumped)

First of all, thank you for this excellent library.

Describe the bug

Building TDF matrix: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13905/13905 [00:34<00:00, 408.07it/s]
Building inverted index: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 148864/148864 [00:10<00:00, 14750.18it/s]
Batch search:   0%|                                                                   | 0/13905 [00:00<?, ?it/s]
Segmentation fault      (core dumped)

I am getting Segmentation fault (core dumped) when using bsearch in Sparse Retriever.

Current environment
  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 3090
    - available: True
    - version: 12.1

  • Packages:
    - absl-py: 2.0.0
    - accelerate: 0.24.1
    - aiohttp: 3.8.6
    - aiosignal: 1.3.1
    - alembic: 1.12.1
    - antlr4-python3-runtime: 4.9.3
    - appdirs: 1.4.4
    - async-timeout: 4.0.3
    - attrs: 23.1.0
    - autofaiss: 2.15.8
    - beautifulsoup4: 4.12.2
    - bleach: 6.1.0
    - cachetools: 5.3.2
    - cbor: 1.0.0
    - cbor2: 5.5.1
    - certifi: 2023.7.22
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - colorlog: 6.7.0
    - contourpy: 1.2.0
    - cramjam: 2.7.0
    - cycler: 0.12.1
    - dill: 0.3.7
    - docker-pycreds: 0.4.0
    - embedding-reader: 1.5.1
    - faiss-cpu: 1.7.4
    - fastparquet: 2023.10.1
    - filelock: 3.13.1
    - fire: 0.4.0
    - fonttools: 4.44.0
    - frozenlist: 1.4.0
    - fsspec: 2023.10.0
    - gitdb: 4.0.11
    - gitpython: 3.1.40
    - google-auth: 2.23.4
    - google-auth-oauthlib: 1.1.0
    - greenlet: 3.0.1
    - grpcio: 1.59.2
    - huggingface-hub: 0.17.3
    - hydra-core: 1.3.2
    - idna: 3.4
    - ijson: 3.2.3
    - indxr: 0.1.5
    - inscriptis: 2.3.2
    - ir-datasets: 0.5.5
    - jinja2: 3.1.2
    - joblib: 1.3.2
    - kaggle: 1.5.16
    - keybert: 0.8.3
    - kiwisolver: 1.4.5
    - krovetzstemmer: 0.8
    - lightning-utilities: 0.9.0
    - llvmlite: 0.41.1
    - lxml: 4.9.3
    - lz4: 4.3.2
    - mako: 1.3.0
    - markdown: 3.5.1
    - markdown-it-py: 3.0.0
    - markupsafe: 2.1.3
    - matplotlib: 3.8.1
    - mdurl: 0.1.2
    - mpmath: 1.3.0
    - multidict: 6.0.4
    - multipipe: 0.1.0
    - multiprocess: 0.70.15
    - networkx: 3.2.1
    - nltk: 3.8.1
    - nmslib: 2.1.1
    - numba: 0.58.1
    - numpy: 1.26.1
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 8.9.2.26
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-nccl-cu12: 2.18.1
    - nvidia-nvjitlink-cu12: 12.3.52
    - nvidia-nvtx-cu12: 12.1.105
    - oauthlib: 3.2.2
    - omegaconf: 2.3.0
    - oneliner-utils: 0.1.2
    - optuna: 3.4.0
    - orjson: 3.9.10
    - packaging: 23.2
    - pandas: 1.5.3
    - pillow: 10.1.0
    - pip: 23.3.1
    - protobuf: 4.23.4
    - psutil: 5.9.6
    - pyarrow: 12.0.1
    - pyasn1: 0.5.0
    - pyasn1-modules: 0.3.0
    - pyautocorpus: 0.1.12
    - pybind11: 2.6.1
    - pygments: 2.16.1
    - pyparsing: 3.1.1
    - pystemmer: 2.0.1
    - python-dateutil: 2.8.2
    - python-slugify: 8.0.1
    - pytorch-lightning: 2.1.1
    - pytorch-metric-learning: 2.3.0
    - pytz: 2023.3.post1
    - pyyaml: 6.0.1
    - ranx: 0.3.18
    - regex: 2023.10.3
    - requests: 2.31.0
    - requests-oauthlib: 1.3.1
    - retriv: 0.2.3
    - rich: 13.6.0
    - rsa: 4.9
    - safetensors: 0.4.0
    - scikit-learn: 1.3.2
    - scipy: 1.11.3
    - seaborn: 0.13.0
    - sentence-transformers: 2.2.2
    - sentencepiece: 0.1.99
    - sentry-sdk: 1.39.1
    - setproctitle: 1.3.3
    - setuptools: 68.2.2
    - six: 1.16.0
    - smmap: 5.0.1
    - soupsieve: 2.5
    - sqlalchemy: 2.0.23
    - sympy: 1.12
    - tabulate: 0.9.0
    - tensorboard: 2.15.1
    - tensorboard-data-server: 0.7.2
    - termcolor: 2.3.0
    - text-unidecode: 1.3
    - threadpoolctl: 3.2.0
    - tokenizers: 0.14.1
    - torch: 2.1.0
    - torchaudio: 2.1.0
    - torchmetrics: 1.2.0
    - torchvision: 0.16.0
    - tqdm: 4.66.1
    - transformers: 4.35.0
    - trec-car-tools: 2.6
    - triton: 2.1.0
    - typing-extensions: 4.8.0
    - unidecode: 1.3.7
    - unlzw3: 0.2.2
    - urllib3: 2.0.7
    - wandb: 0.16.1
    - warc3-wet: 0.2.3
    - warc3-wet-clueweb09: 0.2.5
    - webencodings: 0.5.1
    - werkzeug: 3.0.1
    - wheel: 0.41.2
    - yarl: 1.9.2
    - zlib-state: 0.1.6

  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.13
    - release: 5.15.0-88-generic
    - version: #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023

Input file format

Hi, I'm pretty new to this. Can you give an example what a input file in jsonl format looks like?

Minimal example for Hybrid Search fails

First, I really like this project !

Respective sparse and dense examples work with minimal setup.

Issue is with the hybrid mode.

Here is the code:

from retriv import HybridRetriever

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

hr = HybridRetriever(
    # Shared params ------------------------------------------------------------
    index_name="hybrid-index",
    # Sparse retriever params --------------------------------------------------
    sr_model="bm25",
    min_df=1,
    tokenizer="whitespace",
    stemmer="english",
    stopwords="english",
    do_lowercasing=True,
    do_ampersand_normalization=True,
    do_special_chars_normalization=True,
    do_acronyms_normalization=True,
    do_punctuation_removal=True,
    # Dense retriever params ---------------------------------------------------
    dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
    normalize=True,
    max_length=128,
    use_ann=True,
)

he = hr.index(collection)
he.search(
  query="witches",    # What to search for        
  return_docs=True,          # Default value, return the text of the documents
  cutoff=5,                # 100 is Default value, number of results to return
)

Error:

Building TDF matrix: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:01<00:00,  3.41it/s]
Building inverted index: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 20661.60it/s]
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 99.58it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /tmp/ipykernel_45461/1793453458.py:32 in <module>                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py'                        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve โ”‚
โ”‚ r.py:255 in search                                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   252 โ”‚   โ”‚   """                                                                                โ”‚
โ”‚   253 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   254 โ”‚   โ”‚   sparse_results = self.sparse_retriever.search(query, False, 1_000)                 โ”‚
โ”‚ โฑ 255 โ”‚   โ”‚   dense_results = self.dense_retriever.search(query, False, 1_000)                   โ”‚
โ”‚   256 โ”‚   โ”‚   hybrid_results = self.merger.fuse([sparse_results, dense_results])                 โ”‚
โ”‚   257 โ”‚   โ”‚   return (                                                                           โ”‚
โ”‚   258 โ”‚   โ”‚   โ”‚   self.prepare_results(                                                          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever โ”‚
โ”‚ /dense_retriever.py:251 in search                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   248 โ”‚   โ”‚   โ”‚   โ”‚   self.load_embeddings()                                                     โ”‚
โ”‚   249 โ”‚   โ”‚   โ”‚   doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff)       โ”‚
โ”‚   250 โ”‚   โ”‚                                                                                      โ”‚
โ”‚ โฑ 251 โ”‚   โ”‚   doc_ids = self.map_internal_ids_to_original_ids(doc_ids)                           โ”‚
โ”‚   252 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   253 โ”‚   โ”‚   return (                                                                           โ”‚
โ”‚   254 โ”‚   โ”‚   โ”‚   self.prepare_results(doc_ids, scores)                                          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. โ”‚
โ”‚ py:87 in map_internal_ids_to_original_ids                                                        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    84 โ”‚   โ”‚   return results                                                                     โ”‚
โ”‚    85 โ”‚                                                                                          โ”‚
โ”‚    86 โ”‚   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            โ”‚
โ”‚ โฑ  87 โ”‚   โ”‚   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             โ”‚
โ”‚    88 โ”‚                                                                                          โ”‚
โ”‚    89 โ”‚   def save(self):                                                                        โ”‚
โ”‚    90 โ”‚   โ”‚   raise NotImplementedError()                                                        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. โ”‚
โ”‚ py:87 in <listcomp>                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    84 โ”‚   โ”‚   return results                                                                     โ”‚
โ”‚    85 โ”‚                                                                                          โ”‚
โ”‚    86 โ”‚   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            โ”‚
โ”‚ โฑ  87 โ”‚   โ”‚   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             โ”‚
โ”‚    88 โ”‚                                                                                          โ”‚
โ”‚    89 โ”‚   def save(self):                                                                        โ”‚
โ”‚    90 โ”‚   โ”‚   raise NotImplementedError()                                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
KeyError: -1

Qrels and Run query ids do not match

Trial 0 failed with parameters: {'b': 0.37, 'k1': 9.600000000000001} because of the following error: 
AssertionError('Qrels and Run query ids do not match').

I guess this issue happens because in qrels, there are not all possible scores for all possible results in the run.
Wouldn't it be interesting to filter the run dictionary for only the evaluated cases that occur in qrels ?

HybridRetriever does not respect cutoff when calling sub-retrievers and the merger

In HybridRetriever.search:

        sparse_results = self.sparse_retriever.search(query, False, 1_000)
        dense_results = self.dense_retriever.search(query, False, 1_000)
        hybrid_results = self.merger.fuse([sparse_results, dense_results])

cutoff is not passed down.

potential fix:

        sparse_results = self.sparse_retriever.search(query, False, cutoff)
        dense_results = self.dense_retriever.search(query, False, cutoff)
        hybrid_results = self.merger.fuse([sparse_results, dense_results], cutoff)

using another ANN

@AmenRa :

Thanks for good project.
Suggest to use Qdrant library for in memory search as alternative to FAISS.
Can help to implement it.

Thansk !

Getting Out of Memory Error

Hi,

I have a dataset which has around 2million rows and each text is no more than 20 tokens. I tried building using SparseRetreiver

from retriv import SparseRetriever

sr = SparseRetriever(
  index_name="bm25",
  model="bm25",
  min_df=1,
  tokenizer="whitespace",
  stemmer="english",
  stopwords="english",
  do_lowercasing=True,
  do_ampersand_normalization=True,
  do_special_chars_normalization=True,
  do_acronyms_normalization=True,
  do_punctuation_removal=False,
)
collections = [{"id": id, "text": text} for id, text in zip(ids, descs)]
sr.index(collections)

My disc space is around 14GB and RAM is around 96GB with 24 processors. is there any option to chunk the data and index it one chunk at a time?

[Feature Request] Ability to search and index documents with other metadata

Hi,
Nice choice with War Pigs in the example. :)

Been looking for a pure-python based search engine, ever since Whoosh stopped being actively developed.

Realize this library is just getting started out, but was wondering if it is possible to add the ability to search and filter by metadata as well.

For example

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses", "album": "War pigs"},
  {"id": "doc_2", "text": "Finished with my woman", "album": "Paranoid"}
]

I might want to search for all lines where the album has the word "pigs" for eg.

Also, is the search OR by default, as in find ANY of the words in the query. Can we search with AND and other BOOLEAN operators, as well as proximity and phrase search? Lucene has these features.

Any plan to combine knn search with the text search?

[Feature Request] Allow GPU for query embedding

Hi,

Really great and useful library. Thanks for making it available for everyone.

I am mostly applying this for quick evaluation of search models and realized that DenseRetriever is only applying GPU for the documents when building the index, but not for the queries when running search, which makes it a bit slow for larger sets of queries.

Would you consider adding use_gpu keyword argument to search, msearch and bsearch methods of DenseRetriever and HybridRetriever? Looks like it could be handled similarly as in the index method.

Just in case someone else is having the same issue, this problem can be avoided by directly setting the encoder device before running search as follows:

use_gpu = True

dr = dr.index(collection, use_gpu=use_gpu)

if use_gpu:
    dr.encoder.change_device('cuda')

r = dr.bsearch(queries=queries)
dr.encoder.change_device('cpu')

Thanks!

autotune Function Usage Example

I am looking for an example of how to structure the queries and qrels parameters of the autotune function because I searched in the repo and didn't find any example for that. Precisely, what should be the keys and values of queries dict? and similarly for qrels dict?

Thanks in advance for your help.

Doc strings

Hi, I can't help but notice that this codebase is quite missing in doc strings.

This really hinders my experience as someone trying to use it. In particular, auto completion does not work.
I know that you've made your documentation on markdown files, and why not, but it does not explain every function that I may want to use.

Is there an ETA on getting those ?

Pre-computed TF-IDF

Is it possible to pass a pre-computed TF-IDF matrix to the Sparse Retriever?

[BUG] Corrupted log when using SearchEngine

Hi again,

I'm stuck with a strange behavior, that from my tests seems to be related to the use of the SearchEngine.
I'm using a SingletonLogger that logs everything to stdout and persists that log onto a file.
When the program runs, the index takes a bit of time to be calculated and if I check the logfile, I can correctly see everything printed until this point. After the SearchEngine finishes calculating the index, the first row of the logfile becomes a series of nul values.
Below a sample of code and of the log file.
Can anyone give me pointers to solve this?

_logger = Logger()

[code doing stuff, collecting collection mainly]

_logger.info("Building index...")
SearchEngine("new-index").index(collection, show_progress=False)
_logger.info("Index built.")

logfile.log

HybridRetriever raise KeyError: -1 if the len of doc less than 1_000

The cutoff of msearch for HybridRetriever is hardcode to 1_000, which makes map_internal_ids_to_original_ids raise KeyError when doc len less than 1_000

sparse_results = self.sparse_retriever.search(query, False, 1_000)
dense_results = self.dense_retriever.search(query, False, 1_000)

Thus, map_internal_ids_to_original_ids should be:

def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:
    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.