amenra / retriv Goto Github PK
View Code? Open in Web Editor NEWA Python Search Engine for Humans ๐ฅธ
License: MIT License
A Python Search Engine for Humans ๐ฅธ
License: MIT License
ร Encountered error while generating package metadata.
โฐโ> See below output.
(venv) celso@capri:~$ pip install retriv
Collecting retriv
Using cached retriv-0.1.4-py3-none-any.whl (20 kB)
Requirement already satisfied: numpy in ./projects/venvs/venv/lib/python3.10/site-packages (from retriv) (1.22.4)
Collecting optuna
Using cached optuna-3.0.5-py3-none-any.whl (348 kB)
Collecting indxr
Using cached indxr-0.1.1-py3-none-any.whl (8.7 kB)
Collecting cyhunspell
Using cached CyHunspell-1.3.4.tar.gz (2.7 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
ร python setup.py egg_info did not run successfully.
โ exit code: 1
โฐโ> [27 lines of output]
Downloading https://github.com/hunspell/hunspell/archive/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz
Extracting /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external
Traceback (most recent call last):
File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 226, in pkgconfig
raise RuntimeError(response)
RuntimeError: /bin/sh: 1: pkg-config: not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/setup.py", line 46, in <module>
hunspell_config = pkgconfig('hunspell', language='c++')
File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 259, in pkgconfig
lib_path = build_hunspell_package(os.path.join(BASE_DIR, 'external', 'hunspell-1.6.2'))
File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 189, in build_hunspell_package
check_call(['autoreconf', '-vfi'])
File "/usr/lib/python3.10/subprocess.py", line 364, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib/python3.10/subprocess.py", line 345, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'autoreconf'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
ร Encountered error while generating package metadata.
โฐโ> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(venv) celso@capri:~$
File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 175, in index
self.index_aux(
File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 137, in index_aux
self.ann_searcher.build()
File "/lib/python3.8/site-packages/retriv/dense_retriever/ann_searcher.py", line 27, in build
index, index_infos = build_index(
File "/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 205, in build_index
embedding_reader = EmbeddingReader(
File "/lib/python3.8/site-packages/embedding_reader/embedding_reader.py", line 20, in __init__
self.reader = NumpyReader(embeddings_folder)
File "/lib/python3.8/site-packages/embedding_reader/numpy_reader.py", line 67, in __init__
self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 15, in get_file_list
return _get_file_list(path, file_format)
File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 46, in _get_file_list
file_paths = fs.glob(glob_pattern)
File "/lib/python3.8/site-packages/fsspec/spec.py", line 606, in glob
pattern = glob_translate(path + ("/" if ends_with_sep else ""))
File "/lib/python3.8/site-packages/fsspec/utils.py", line 734, in glob_translate
raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component
this error does not occur with fsspec==2023.5.0
have a good experience at using retriv ,
but nltk will download some files when users start to use the package, this means that users developing in offline enviroment can't use it at all.
I think let users to define dicts for their own usage is more people-friendly.
@inproceedings{petri2013exploring,
title={Exploring the magic of WAND},
author={Petri, Matthias and Culpepper, J Shane and Moffat, Alistair},
booktitle={Proceedings of the 18th Australasian Document Computing Symposium},
pages={58--65},
year={2013}
}
I believe if you're using inverted index and token - docs list, using the WAND Top-K Retrieval Algorithm can speedup retrieval for small K in large documents. I'm not sure whether it's relevant to this project. I've once implemented this https://raw.githubusercontent.com/hockyy/ir-pa-2/main/bsbi.py
When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.
Hello, I am having a good experience with retriv
:-)
Is there an approach to add progress status when performing msearch
or bsearch
?
Hi AmenRa,
First of all I'd like to thank you for your efforts.
I'm trying to use retriv, but when I use the sample code you provided in the readme, I get the following error:
Building TDF matrix: 0%| | 0/4 [00:00<?, ?it/s] vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1268, in _count_vocab
for doc in raw_documents:
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\tqdm\std.py", line 1182, in __iter__
for obj in iterable:
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multipipe\multipipe.py", line 28, in to_generator
with Pool(n_threads) as pool:
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 119, in Pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\pool.py", line 329, in _repopulate_pool_static
w.start()
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 336, in _Popen
return Popen(process_obj)
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
return Pool(processes, initializer, initargs, maxtasksperchild,
By just running the example code:
# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
se = SearchEngine("new-index").index(collection)
se.search("witches masses")
Could you please help me fix this issue?
@fulmicoton created a quite nice Search Benchmark where engines can be benchmarked against a fixed set of queries. It would be nice to integrate retriv
into it and see how it performs against optimized ones. Maybe the "search engine for the common man" ends up being quite competitive too.
EDIT: Adding the main repo for reference
Traceback (most recent call last):
File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 251, in search
doc_ids = self.map_internal_ids_to_original_ids(doc_ids)
File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in map_internal_ids_to_original_ids
return [self.id_mapping[doc_id] for doc_id in doc_ids]
File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in <listcomp>
return [self.id_mapping[doc_id] for doc_id in doc_ids]
KeyError: -1
Update "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80 to
return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]
would fix the problem.
Hi,
Thank you for a nice Elastic/Pinecone replacement ๐
A small question (or perhaps a feature request): is it possible to use different neural networks for indexing and retrieval?
I mean, with CLIP model one first calculates vectors of images, and then uses second part of the same model to encode text queries.
Hi,
I understand that there are reasons why we only want to do indexing once, since there are corpus-level statistics that need to be calculated.
But is there any way to index a huge batch of documents, then index a few more, assuming they are from the same distribution?
Alex
Would it be beneficial to include some TfidfVectorizer args such as ngram_range
and max_features
?
Dear @AmenRa,
Thanks for releasing the clean retrieval library! I was wondering if it's possible to set custom cache directories? By default, it seems like the index is being stored in ~/.retriv
?
Thank you!
First of all, thank you for this excellent library.
Describe the bug
Building TDF matrix: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 13905/13905 [00:34<00:00, 408.07it/s]
Building inverted index: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 148864/148864 [00:10<00:00, 14750.18it/s]
Batch search: 0%| | 0/13905 [00:00<?, ?it/s]
Segmentation fault (core dumped)
I am getting Segmentation fault (core dumped)
when using bsearch
in Sparse Retriever.
CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- available: True
- version: 12.1
Packages:
- absl-py: 2.0.0
- accelerate: 0.24.1
- aiohttp: 3.8.6
- aiosignal: 1.3.1
- alembic: 1.12.1
- antlr4-python3-runtime: 4.9.3
- appdirs: 1.4.4
- async-timeout: 4.0.3
- attrs: 23.1.0
- autofaiss: 2.15.8
- beautifulsoup4: 4.12.2
- bleach: 6.1.0
- cachetools: 5.3.2
- cbor: 1.0.0
- cbor2: 5.5.1
- certifi: 2023.7.22
- charset-normalizer: 3.3.2
- click: 8.1.7
- colorlog: 6.7.0
- contourpy: 1.2.0
- cramjam: 2.7.0
- cycler: 0.12.1
- dill: 0.3.7
- docker-pycreds: 0.4.0
- embedding-reader: 1.5.1
- faiss-cpu: 1.7.4
- fastparquet: 2023.10.1
- filelock: 3.13.1
- fire: 0.4.0
- fonttools: 4.44.0
- frozenlist: 1.4.0
- fsspec: 2023.10.0
- gitdb: 4.0.11
- gitpython: 3.1.40
- google-auth: 2.23.4
- google-auth-oauthlib: 1.1.0
- greenlet: 3.0.1
- grpcio: 1.59.2
- huggingface-hub: 0.17.3
- hydra-core: 1.3.2
- idna: 3.4
- ijson: 3.2.3
- indxr: 0.1.5
- inscriptis: 2.3.2
- ir-datasets: 0.5.5
- jinja2: 3.1.2
- joblib: 1.3.2
- kaggle: 1.5.16
- keybert: 0.8.3
- kiwisolver: 1.4.5
- krovetzstemmer: 0.8
- lightning-utilities: 0.9.0
- llvmlite: 0.41.1
- lxml: 4.9.3
- lz4: 4.3.2
- mako: 1.3.0
- markdown: 3.5.1
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.1
- mdurl: 0.1.2
- mpmath: 1.3.0
- multidict: 6.0.4
- multipipe: 0.1.0
- multiprocess: 0.70.15
- networkx: 3.2.1
- nltk: 3.8.1
- nmslib: 2.1.1
- numba: 0.58.1
- numpy: 1.26.1
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.18.1
- nvidia-nvjitlink-cu12: 12.3.52
- nvidia-nvtx-cu12: 12.1.105
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- oneliner-utils: 0.1.2
- optuna: 3.4.0
- orjson: 3.9.10
- packaging: 23.2
- pandas: 1.5.3
- pillow: 10.1.0
- pip: 23.3.1
- protobuf: 4.23.4
- psutil: 5.9.6
- pyarrow: 12.0.1
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pyautocorpus: 0.1.12
- pybind11: 2.6.1
- pygments: 2.16.1
- pyparsing: 3.1.1
- pystemmer: 2.0.1
- python-dateutil: 2.8.2
- python-slugify: 8.0.1
- pytorch-lightning: 2.1.1
- pytorch-metric-learning: 2.3.0
- pytz: 2023.3.post1
- pyyaml: 6.0.1
- ranx: 0.3.18
- regex: 2023.10.3
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- retriv: 0.2.3
- rich: 13.6.0
- rsa: 4.9
- safetensors: 0.4.0
- scikit-learn: 1.3.2
- scipy: 1.11.3
- seaborn: 0.13.0
- sentence-transformers: 2.2.2
- sentencepiece: 0.1.99
- sentry-sdk: 1.39.1
- setproctitle: 1.3.3
- setuptools: 68.2.2
- six: 1.16.0
- smmap: 5.0.1
- soupsieve: 2.5
- sqlalchemy: 2.0.23
- sympy: 1.12
- tabulate: 0.9.0
- tensorboard: 2.15.1
- tensorboard-data-server: 0.7.2
- termcolor: 2.3.0
- text-unidecode: 1.3
- threadpoolctl: 3.2.0
- tokenizers: 0.14.1
- torch: 2.1.0
- torchaudio: 2.1.0
- torchmetrics: 1.2.0
- torchvision: 0.16.0
- tqdm: 4.66.1
- transformers: 4.35.0
- trec-car-tools: 2.6
- triton: 2.1.0
- typing-extensions: 4.8.0
- unidecode: 1.3.7
- unlzw3: 0.2.2
- urllib3: 2.0.7
- wandb: 0.16.1
- warc3-wet: 0.2.3
- warc3-wet-clueweb09: 0.2.5
- webencodings: 0.5.1
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.2
- zlib-state: 0.1.6
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.13
- release: 5.15.0-88-generic
- version: #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023
Hi, I'm pretty new to this. Can you give an example what a input file in jsonl format looks like?
First, I really like this project !
Respective sparse and dense examples work with minimal setup.
Issue is with the hybrid mode.
Here is the code:
from retriv import HybridRetriever
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
hr = HybridRetriever(
# Shared params ------------------------------------------------------------
index_name="hybrid-index",
# Sparse retriever params --------------------------------------------------
sr_model="bm25",
min_df=1,
tokenizer="whitespace",
stemmer="english",
stopwords="english",
do_lowercasing=True,
do_ampersand_normalization=True,
do_special_chars_normalization=True,
do_acronyms_normalization=True,
do_punctuation_removal=True,
# Dense retriever params ---------------------------------------------------
dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
normalize=True,
max_length=128,
use_ann=True,
)
he = hr.index(collection)
he.search(
query="witches", # What to search for
return_docs=True, # Default value, return the text of the documents
cutoff=5, # 100 is Default value, number of results to return
)
Error:
Building TDF matrix: 100%|โโโโโโโโโโ| 4/4 [00:01<00:00, 3.41it/s]
Building inverted index: 100%|โโโโโโโโโโ| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|โโโโโโโโโโ| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 20661.60it/s]
100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 99.58it/s]
0%| | 0/1 [00:00<?, ?it/s]
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /tmp/ipykernel_45461/1793453458.py:32 in <module> โ
โ โ
โ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py' โ
โ โ
โ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve โ
โ r.py:255 in search โ
โ โ
โ 252 โ โ """ โ
โ 253 โ โ โ
โ 254 โ โ sparse_results = self.sparse_retriever.search(query, False, 1_000) โ
โ โฑ 255 โ โ dense_results = self.dense_retriever.search(query, False, 1_000) โ
โ 256 โ โ hybrid_results = self.merger.fuse([sparse_results, dense_results]) โ
โ 257 โ โ return ( โ
โ 258 โ โ โ self.prepare_results( โ
โ โ
โ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever โ
โ /dense_retriever.py:251 in search โ
โ โ
โ 248 โ โ โ โ self.load_embeddings() โ
โ 249 โ โ โ doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff) โ
โ 250 โ โ โ
โ โฑ 251 โ โ doc_ids = self.map_internal_ids_to_original_ids(doc_ids) โ
โ 252 โ โ โ
โ 253 โ โ return ( โ
โ 254 โ โ โ self.prepare_results(doc_ids, scores) โ
โ โ
โ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. โ
โ py:87 in map_internal_ids_to_original_ids โ
โ โ
โ 84 โ โ return results โ
โ 85 โ โ
โ 86 โ def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]: โ
โ โฑ 87 โ โ return [self.id_mapping[doc_id] for doc_id in doc_ids] โ
โ 88 โ โ
โ 89 โ def save(self): โ
โ 90 โ โ raise NotImplementedError() โ
โ โ
โ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. โ
โ py:87 in <listcomp> โ
โ โ
โ 84 โ โ return results โ
โ 85 โ โ
โ 86 โ def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]: โ
โ โฑ 87 โ โ return [self.id_mapping[doc_id] for doc_id in doc_ids] โ
โ 88 โ โ
โ 89 โ def save(self): โ
โ 90 โ โ raise NotImplementedError() โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
KeyError: -1
Trial 0 failed with parameters: {'b': 0.37, 'k1': 9.600000000000001} because of the following error:
AssertionError('Qrels and Run query ids do not match').
I guess this issue happens because in qrels
, there are not all possible scores for all possible results in the run
.
Wouldn't it be interesting to filter the run
dictionary for only the evaluated cases that occur in qrels
?
sparse_results = self.sparse_retriever.search(query, False, 1_000)
dense_results = self.dense_retriever.search(query, False, 1_000)
hybrid_results = self.merger.fuse([sparse_results, dense_results])
cutoff is not passed down.
potential fix:
sparse_results = self.sparse_retriever.search(query, False, cutoff)
dense_results = self.dense_retriever.search(query, False, cutoff)
hybrid_results = self.merger.fuse([sparse_results, dense_results], cutoff)
CyHunSpell can't be installed in non-debian based Linux.
@AmenRa :
Thanks for good project.
Suggest to use Qdrant library for in memory search as alternative to FAISS.
Can help to implement it.
Thansk !
Does Advanced Retrieve support semantic searching? If not are there future plans for it?
It would be great to include other retrieval approaches such as Pointwise Mutual Information.
Hi,
I have a dataset which has around 2million rows and each text is no more than 20 tokens. I tried building using SparseRetreiver
from retriv import SparseRetriever
sr = SparseRetriever(
index_name="bm25",
model="bm25",
min_df=1,
tokenizer="whitespace",
stemmer="english",
stopwords="english",
do_lowercasing=True,
do_ampersand_normalization=True,
do_special_chars_normalization=True,
do_acronyms_normalization=True,
do_punctuation_removal=False,
)
collections = [{"id": id, "text": text} for id, text in zip(ids, descs)]
sr.index(collections)
My disc space is around 14GB and RAM is around 96GB with 24 processors. is there any option to chunk the data and index it one chunk at a time?
Hi,
Nice choice with War Pigs in the example. :)
Been looking for a pure-python based search engine, ever since Whoosh stopped being actively developed.
Realize this library is just getting started out, but was wondering if it is possible to add the ability to search and filter by metadata as well.
For example
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses", "album": "War pigs"},
{"id": "doc_2", "text": "Finished with my woman", "album": "Paranoid"}
]
I might want to search for all lines where the album has the word "pigs" for eg.
Also, is the search OR by default, as in find ANY of the words in the query. Can we search with AND and other BOOLEAN operators, as well as proximity and phrase search? Lucene has these features.
Any plan to combine knn search with the text search?
Hi,
Really great and useful library. Thanks for making it available for everyone.
I am mostly applying this for quick evaluation of search models and realized that DenseRetriever
is only applying GPU for the documents when building the index, but not for the queries when running search, which makes it a bit slow for larger sets of queries.
Would you consider adding use_gpu
keyword argument to search
, msearch
and bsearch
methods of DenseRetriever
and HybridRetriever
? Looks like it could be handled similarly as in the index
method.
Just in case someone else is having the same issue, this problem can be avoided by directly setting the encoder device before running search as follows:
use_gpu = True
dr = dr.index(collection, use_gpu=use_gpu)
if use_gpu:
dr.encoder.change_device('cuda')
r = dr.bsearch(queries=queries)
dr.encoder.change_device('cpu')
Thanks!
How to specify where to save the index?
Hi! I see that retriv's speed is really impressive in seepd.md. Did you also compare their performances?
I am looking for an example of how to structure the queries
and qrels
parameters of the autotune function because I searched in the repo and didn't find any example for that. Precisely, what should be the keys and values of queries
dict? and similarly for qrels
dict?
Thanks in advance for your help.
Hi, I can't help but notice that this codebase is quite missing in doc strings.
This really hinders my experience as someone trying to use it. In particular, auto completion does not work.
I know that you've made your documentation on markdown files, and why not, but it does not explain every function that I may want to use.
Is there an ETA on getting those ?
Is it possible to pass a pre-computed TF-IDF matrix to the Sparse Retriever?
Hi again,
I'm stuck with a strange behavior, that from my tests seems to be related to the use of the SearchEngine.
I'm using a SingletonLogger that logs everything to stdout and persists that log onto a file.
When the program runs, the index takes a bit of time to be calculated and if I check the logfile, I can correctly see everything printed until this point. After the SearchEngine finishes calculating the index, the first row of the logfile becomes a series of nul values.
Below a sample of code and of the log file.
Can anyone give me pointers to solve this?
_logger = Logger()
[code doing stuff, collecting collection mainly]
_logger.info("Building index...")
SearchEngine("new-index").index(collection, show_progress=False)
_logger.info("Index built.")
The cutoff of msearch
for HybridRetriever
is hardcode to 1_000, which makes map_internal_ids_to_original_ids
raise KeyError
when doc len less than 1_000
retriv/retriv/hybrid_retriever.py
Lines 254 to 255 in c9baa01
Thus, map_internal_ids_to_original_ids
should be:
def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:
return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]
What would be the time complexity of sparse search using BM25?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.