beir-cellar / beir Goto Github PK
View Code? Open in Web Editor NEWA Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Home Page: http://beir.ai
License: Apache License 2.0
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Home Page: http://beir.ai
License: Apache License 2.0
In the custom_metrics.py file, top_accuracy function:
you write:
top_hits[query_id] = sorted(doc_scores.keys(), key=lambda item: item[1], reverse=True)[0:k_max]
but I think that you should instead wrote [elem[0] for elem in
sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)[0:k_max]]
indeed, in your code, you are sorting by the key and not by the score
Hi, thanks for your awesome work! Dose this framework support Chinese ? How can I use it in my own Chinese dataset (sparse ,dense ...),I mean that can I use my own tokenizer?
thanks
Hey,
why are Quora & CQADupstack excluded from the leaderboard?
Hi, I saw from https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns/edit#gid=0 that Robust04 has been added to the BEIR leaderboard. Thanks for providing the results! Meanwhile, I was wondering what preprocessing is made for the dataset. For example, which fields of the topics are used for constructing the queries (title, desc, or narr)? Which parts of the documents are included (Headline, Text, Date, etc)? I'm asking because I tried to evaluate ANCE (the publicly released checkpoint) on Robust04 with minor preprocessing, but the ndcg@10 score is around 0.33, which is much lower than 0.39 as reported in the leaderboard. Thanks a lot!
Hi,
Thanks for the great work. BEIR is extremely valuable!
I just tried to run BEIR.ipynb on Goggle Colab and I was unable to complete "Lexical Retrieval using BM25 (Elasticsearch)" section due to an unsupported error from ElasticSearch as shown below:
I tried different versions but I couldn't get it to work. Any advice?
Hi,
Thank you for publicizing this great benchmark!
I am interested in the retrieval performance of models that are not trained on retrieval datasets (e.g., MS MARCO). I think this benchmark already supports USE, but it is not listed on the leaderboard. Did you already test the performance of the model on this benchmark? I think it would be very helpful if the leaderboard also provides results of general sentence embedding models (e.g., USE, SimCSE).
Hello,
I think there is an error in sparse_search.py
; only the first query in each batch is evaluated.
Thank you!
Thanks for your great works. I notice that average number of docs per query changes from 49.9 to 19.0 in your latest version of the paper. To keep the same setting as your, what should I need to do?
Thanks.
Hi,
I've got a dense IR pipeline running with rerank, for a search engine application. However my rerank scores are lower than just a dense IR run?
msmarco-distilbert-base-v3
ms-marco-electra-base cross encoder
Scores:
Dense IR DenseIR + Re-Rank
2021-11-30 16:48:39 - NDCG@1: 0.3629 2021-11-30 16:56:16 - NDCG@1: 0.3538
2021-11-30 16:48:39 - NDCG@3: 0.5234 2021-11-30 16:56:16 - NDCG@3: 0.5170
2021-11-30 16:48:39 - NDCG@5: 0.5472 2021-11-30 16:56:16 - NDCG@5: 0.5401
2021-11-30 16:48:39 - NDCG@10: 0.5623 2021-11-30 16:56:16 - NDCG@10: 0.5540
2021-11-30 16:48:39 - NDCG@100: 0.5879 2021-11-30 16:56:16 - NDCG@100: 0.5812
2021-11-30 16:48:39 - NDCG@1000: 0.5965 2021-11-30 16:56:16 - NDCG@1000: 0.5812
2021-11-30 16:48:39 - MAP@1: 0.3629 2021-11-30 16:56:16 - MAP@1: 0.3538
2021-11-30 16:48:39 - MAP@3: 0.4844 2021-11-30 16:56:16 - MAP@3: 0.4774
2021-11-30 16:48:39 - MAP@5: 0.4977 2021-11-30 16:56:16 - MAP@5: 0.4903
2021-11-30 16:48:39 - MAP@10: 0.5040 2021-11-30 16:56:16 - MAP@10: 0.4961
2021-11-30 16:48:39 - MAP@100: 0.5090 2021-11-30 16:56:16 - MAP@100: 0.5013
2021-11-30 16:48:39 - MAP@1000: 0.5093 2021-11-30 16:56:16 - MAP@1000: 0.5013
2021-11-30 16:48:39 - Recall@1: 0.3629 2021-11-30 16:56:16 - Recall@1: 0.3538
2021-11-30 16:48:39 - Recall@3: 0.6362 2021-11-30 16:56:16 - Recall@3: 0.6315
2021-11-30 16:48:39 - Recall@5: 0.6932 2021-11-30 16:56:16 - Recall@5: 0.6869
2021-11-30 16:48:39 - Recall@10: 0.7397 2021-11-30 16:56:16 - Recall@10: 0.7297
2021-11-30 16:48:39 - Recall@100: 0.8627 2021-11-30 16:56:16 - Recall@100: 0.8618
2021-11-30 16:48:39 - Recall@1000: 0.9310 2021-11-30 16:56:16 - Recall@1000: 0.8618
2021-11-30 16:48:39 - P@1: 0.3629 2021-11-30 16:56:16 - P@1: 0.3538
2021-11-30 16:48:39 - P@3: 0.2121 2021-11-30 16:56:16 - P@3: 0.2105
2021-11-30 16:48:39 - P@5: 0.1386 2021-11-30 16:56:16 - P@5: 0.1374
2021-11-30 16:48:39 - P@10: 0.0740 2021-11-30 16:56:16 - P@10: 0.0730
2021-11-30 16:48:39 - P@100: 0.0086 2021-11-30 16:56:16 - P@100: 0.0086
2021-11-30 16:48:39 - P@1000: 0.0009 2021-11-30 16:56:16 - P@1000: 0.0009
Any thoughts would be greatly appreciated.
Hello,
Thanks for your great work !
I am just curious how can we reproduce some results and get the bioasq, signal1m etc ... datasets to work with beir.
(currently testing our own model on beir ! )
First, a thank you. The paper and repo have been fantastic resources to help conversations around out-of-domain retrieval!
Second, a feature request. I think it would be very interesting to see some of the document/index enrichment approaches added to the benchmark and paper discussion, as extensions to sparse lexical retrieval. You mention both doc2query and DeepCT/HDCT in the paper but don't provide benchmark data for them. Since they are trained on MS MARCO, it would be interesting to see if they perform well out-of-domain and in-comparison to both BM25+CE and ColBERT which perform very well out-of-domain.
Hi, I'm trying to experiment with beir, BM25 and pyserini, but I'm unable to find the docker image (beir/beir-pyserini). Looking at dockerhub (https://hub.docker.com/u/beir) it seems that the only image available is pyserini-fastapi.
Is pyserini-fastapi the same as beir-pyserini or there is something that I'm doing wrong?
Hi,
Syntax issue in evaluate_custom_model.py
due to missing import of typing for List and Dict:
from typing import List, Dict
Thanks for all the great work!
Hi, I have questions regarding the Trec-Covid datasets.
(1) I should retrieve the document in the entire corpus (171K) right?
(2) The qrel test.tsv has three labels 0,1,2, when I get the prediction from the BM25 baseline, how should I assign the values to them? all 1 ?
Thank you
Hi~ I try use BM25 model to evaluate custom dataset , when I use the code as follow:
#### Sentence-Transformer ####
#### Provide any pretrained sentence-transformers model path
#### Complete list - https://www.sbert.net/docs/pretrained_models.html
# model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"))
model = BM25(index_name="your-index-name", hostname="127.0.0.1:9200", initialize=True )
# retriever = EvaluateRetrieval(model, score_function="cos_sim")
retriever = EvaluateRetrieval(model)
#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
but it doesn't work
2021-08-04 17:57:32 - Activating Elasticsearch....
2021-08-04 17:57:32 - Elastic Search Credentials: {'hostname': '127.0.0.1:9200', 'index_name': 'your-index-name', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 'default', 'language': 'english'}
english
2021-08-04 17:57:32 - Deleting previous Elasticsearch-Index named - your-index-name
2021-08-04 17:57:32 - Creating fresh Elasticsearch-Index named - your-index-name
0%| | 0/2 [00:00<?, ?docs/s]
que: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.55it/s]
Traceback (most recent call last):
File "/.../evaluate_custom_dataset.py", line 67, in <module>
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
File "/.../beir/beir/retrieval/evaluation.py", line 74, in evaluate
ndcg[f"NDCG@{k}"] = round(ndcg[f"NDCG@{k}"]/len(scores), 5)
ZeroDivisionError: float division by zero
Any other code need I modify? thank you~
Just saw the submission to Neurips (https://openreview.net/forum?id=wCu6T5xFjeJ), great work!
I was just wondering whether you plan to release the additional annotations for Trec-covid’s hole filling? Thanks!
Thanks for providing this resource!
I noticed that the hash of webis-touche2020.zip
recently changed, so I dug a bit into the differences. It looks like the only changes were changes to webis-touche2020/qrels/test.tsv
:
$ wc -l {old,new}/webis-touche2020/qrels/test.tsv
2963 old/webis-touche2020/qrels/test.tsv
2299 new/webis-touche2020/qrels/test.tsv
$ head {old,new}/webis-touche2020/qrels/test.tsv -n5
==> old/webis-touche2020/qrels/test.tsv <==
query-id corpus-id score
1 197beaca-2019-04-18T11:28:59Z-00001-000 4
1 1a76ed9f-2019-04-18T16:07:27Z-00001-000 5
1 1a76ed9f-2019-04-18T16:07:27Z-00002-000 3
1 1a76ed9f-2019-04-18T16:07:27Z-00005-000 4
==> new/webis-touche2020/qrels/test.tsv <==
query-id corpus-id score
1 S197beaca-A971412e6 0
1 S1b03f390-A22aff8a0 0
1 S1b03f390-Aa73ba80f 1
1 S1b03f390-Ab387b162 0
I'm not super familiar with the task, but based on this page and the linked qrels, it looks like the old version corresponds to version 1 of the args.me corpus, and the new version corresponds with version 2. However, the corpus.jsonl
remains unchanged, so the qrels corpus-id
field in the qrels no longer corresponds with document _id
fields in the corpus.
Can you please clarify these discrepancies?
I'm running evaluate_anserini_docT5query.py and it's currently only able to utilize a single GPU. I'm using GCP instances with 4x V100 GPUs and I'd like to finish more experiments in less time by using more GPUs. Is there a simple way to parallelize batches or configure using multiple GPUs that I'm not aware of?
I think this benchmark may have the function to support choose the best model from a model list,
by compare the performance measurements on one dataset among them.
This require the dataset have same interface.
And support a model combination choose support to switch the use model by different semantic
feature (sometime use “bm25”, sometime use “sbert” , switch by feature character), to make the
final conclusion more consistently.
This will make this benchmark not only a benchmark, but a meta ensemble model framework to
combine and improve the final performance on single dataset wth different features.
Thanks for the great contribution!
I found that the downloaded data of NQ only contains test files and corpus, where can I get the training files?
Thank you!
Hi, I was trying to run your evaluate_bm25.py
baseline, but I got the following error. There may be some problem with elasticsearch
. Could you please help me fix it?
2022-02-17 02:38:34 - Loading Queries...
2022-02-17 02:38:34 - Loaded 300 TEST Queries.
2022-02-17 02:38:34 - Query Example: 0-dimensional biomaterials show inductive properties.
2022-02-17 02:38:34 - Activating Elasticsearch....
2022-02-17 02:38:34 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'scifact', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 1, 'language': 'english'}
Traceback (most recent call last):
File "evaluate_bm25.py", line 64, in <module>
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize, number_of_shards=number_of_shards)
File "/anaconda/envs/beir/lib/python3.8/site-packages/beir/retrieval/search/lexical/bm25_search.py", line 22, in __init__
self.es = ElasticSearch(self.config)
File "/anaconda/envs/beir/lib/python3.8/site-packages/beir/retrieval/search/lexical/elastic_search.py", line 34, in __init__
self.es = Elasticsearch(
File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/__init__.py", line 312, in __init__
node_configs = client_node_configs(
File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/utils.py", line 101, in client_node_configs
node_configs = hosts_to_node_configs(hosts)
File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/utils.py", line 141, in hosts_to_node_configs
node_configs.append(url_to_node_config(host))
File "/anaconda/envs/beir/lib/python3.8/site-packages/elastic_transport/client_utils.py", line 198, in url_to_node_config
raise ValueError(
ValueError: URL must include a 'scheme', 'host', and 'port' component (ie 'https://localhost:9200')
Between the two leaderboard tabs, they are laid out slightly differently. It would be nice to normalize these or put them on a single tab. If anything, just separating the "average" score to only include zero-shot methods in the "re-ranking leaderboard" would help compare across tabs. Basically the same view as is in the paper is preferred.
Hi @NThakur20 I was wondering if we can train a T5 model as when I was loading a T5 model from HF there seems to be an error.
Is there a way to cache/load embedded documents and queries? That would help to save time on embedding big datasets such as ms marco and nq
Does the provided elastic search installation script in the colab work for you?
I'm getting the above error, would appreciate any help :)
Hi, Nandan,
When we create the qrel.tsv file for custom preprocessed dataset, can it just innclude relevant documents, i.e., score=1 or non-relevant documents should also be included?
Cheers
Xiang
Hi,
Very useful work, thanks !
I was wondering why you did not include standard ad-hoc retrieval collections in the benchmark (like Robust04) ? Is it intended ?
For people working on neural IR, it would be interesting to see how models trained on MS MARCO systematically generalize to these collections too
Hi,
Thanks for the wonderful package!
I want to train a ColBERT model on a new test collection which I have. I couldn't find any example related to it.
Could you please point out how to train a ColBERT model with the package?
Thanks
Hi, it seems that some datasets that are on the leaderboard are missing from https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/
For example:
bioasq
signallm
trec-news
Is there a reason for removing those datasets?
Hi! Thanks for sharing BEIR - this is a wonderful project!
I wonder if you can include DPR model trained on MSMARCO on the leaderboard? The current DPR (Multi) and DPR (KILT) are not really comparable with other models which are trained on MARCO.
Thanks!
Can I find the complete script used to generate this leaderboard somewhere? I saw snippets such as benchmark_bm25.py but not a full-scale script that includes elastic search config and all.
I am implementing a BEIR compatible Vespa version that I plan to submit as a PR soon. I am, however, finding different results between my BM25 metrics and the elastic BM25 results from the leaderboard.
Generating results side by side would be great to debug my implementation.
Hi there,
I've been working on a dense IR pipeline with BEIR including a custom dataloader, which works fine for dense IR runs but throws an exception whenever I add a cross encoder for reranking.
Rerank:
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)
Dataloader:
corpus = {}
for index, item in corpusdf.iteritems():
corpus.update({
"doc"+(str(index)): {
"title": "",
"text": item,
},
})
queries = {}
for index, row in queriesdf.iterrows():
queries.update({
"q"+str(index): {
"doc"+(str(index)): row[0],
},
})
qrels = {}
for i in range(len(df)):
qrels.update({
"q"+str(i): {
"doc"+(str(i)): 1,
},
})
Exception:
Traceback (most recent call last):
File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 273, in predict
for features in iterator:
File "C:\Users\costco\venv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
for obj in iterable:
File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
data = self._next_data()
File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 52, in fetch
return self.collate_fn(data)
File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 93, in smart_batching_collate_text_only
texts[idx].append(text.strip())
AttributeError: 'dict' object has no attribute 'strip'
Seems like a simple fix but I am trying to avoid modifying BEIR sources, any ideas would be greatly appreciated!
The below gives me 250 queries for robust04, however you report having only 249 in the paper. How come?
import ir_datasets
dataset = ir_datasets.load("trec-robust04")
x=list(dataset.queries_iter())
len(x)
I couldn't understand what this line is trying to do... corpus_id
and query_id
are from completely different groups and it's fine that they are the same, right? Removing this if
statement has a huge impact on ndcg score (tested with ANCE@arguana).
I am getting an error when running the bm25 evaluation file.
Unable to create Index in Elastic Search. Reason: The client noticed that the server is not a supported distribution of Elasticsearch
The error was not there last week when I ran the file. I am using Colab to run the file. Any idea what maybe the issue here?
I'm using evaluate_anserini_docT5query.py. In that script it's using util.download_and_unzip
then GenericDataLoader
and fails with the following message when using dataset cqadupstack
:
ValueError: File /home/josh/source/beir/examples/retrieval/evaluation/sparse/datasets/cqadupstack/corpus.jsonl not present! Please provide accurate file.
The CQADupstack dataset is divided up into sub-categories which is causing the above error since it contains an extra sub-directory per-category.
ls -las ~/source/beir/examples/retrieval/evaluation/sparse/datasets/cqadupstack/
total 56
4 drwxr-xr-x 14 josh josh 4096 Jul 1 08:38 .
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:35 ..
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:38 android
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 english
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:35 gaming
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 gis
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 mathematica
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 physics
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 programmers
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 stats
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:36 tex
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 unix
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 webmasters
4 drwxr-xr-x 3 josh josh 4096 Jul 1 08:37 wordpress
How was CQADupstack used in the benchmarking for the paper and leaderboard? Was each category processed separately or was everything somehow combined into a single evaluation?
First of all, thanks for this amazing benchmark. I'd like to evaluate a re-ranking model on several datasets. If I got I correctly, I will have to download and index all datasets independently to get the top-100 BM25 rank lists. Could you please provide those for each dataset for easier evaluation?
Thanks a lot!
We see this approach in the "dense leaderboard", but it would be good to add a row to the "re-ranking leaderboard" for easier comparisons against baselines.
I tried to replicate BM25 evaluation on trec-covid, but ran into the following problem:
2021-04-21 16:16:57 - Downloading trec-covid.zip ...
2021-04-21 16:16:57 - Unzipping trec-covid.zip ...
2021-04-21 16:16:58 - Loaded 171332 TEST Documents.
2021-04-21 16:16:58 - Doc Example: {'text': 'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were the most common symptoms, and crepitations (60%), and wheezes (40%) were the most common signs. Most patients with pneumonia had crepitations (79.2%) but only 25% had bronchial breathing. Immunocompromised patients were more likely than non-immunocompromised patients to present with pneumonia (8/9 versus 16/31, P = 0.05). Of the 24 patients with pneumonia, 14 (58.3%) had uneventful recovery, 4 (16.7%) recovered following some complications, 3 (12.5%) died because of M pneumoniae infection, and 3 (12.5%) died due to underlying comorbidities. The 3 patients who died of M pneumoniae pneumonia had other comorbidities. CONCLUSION: our results were similar to published data except for the finding that infections were more common in infants and preschool children and that the mortality rate of pneumonia in patients with comorbidities was high.', 'title': 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia'}
2021-04-21 16:16:58 - Loaded 50 TEST Queries.
2021-04-21 16:16:58 - Query Example: what is the origin of COVID-19
2021-04-21 16:16:58 - Activating Elasticsearch....
2021-04-21 16:16:58 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'trec-covid', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24}
2021-04-21 16:16:58 - Deleting previous Elasticsearch-Index named - trec-covid
2021-04-21 16:16:58 - Unable to create Index in Elastic Search. Reason: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364110>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364110>: Failed to establish a new connection: [Errno 111] Connection refused)
2021-04-21 16:16:58 - Creating fresh Elasticsearch-Index named - trec-covid
2021-04-21 16:16:58 - Unable to create Index in Elastic Search. Reason: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364c10>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364c10>: Failed to establish a new connection: [Errno 111] Connection refused)
Traceback (most recent call last):
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "evaluate_bm25.py", line 33, in <module>
results = retriever.retrieve(corpus, queries)
File "/home/USER/projects/beir-repo/beir/retrieval/evaluation.py", line 22, in retrieve
return self.retriever.search(corpus, queries, self.top_k, self.score_function)
File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/bm25_search.py", line 33, in search
self.index(corpus)
File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/bm25_search.py", line 66, in index
progress=progress
File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/elastic_search.py", line 89, in bulk_add_to_index
client=self.es, index=self.index_name, actions=generate_actions,
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 326, in streaming_bulk
**kwargs
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 246, in _process_bulk_chunk
for item in gen:
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 193, in _process_bulk_chunk_error
raise error
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 234, in _process_bulk_chunk
resp = client.bulk("\n".join(bulk_actions) + "\n", *args, **kwargs)
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 153, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 460, in bulk
body=body,
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/transport.py", line 413, in perform_request
raise e
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/transport.py", line 388, in perform_request
timeout=timeout,
File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 264, in perform_request
raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused)
The only things I've changed in evaluate_bm25.py
are
dataset = "trec-covid"
(line 17)hostname = "localhost" #localhost
(line 26)index_name = dataset # scifact
(line 27)Elastic Search is new to me so I'm not sure whether I've missed anything. I used pip install -e .
to install BEIR from source if that's relevant. Thanks!
Hello,
Thanks for releasing the benchmark datasets. These are great work!
After downloading the ArguAna dataset you provided, we found that the text
field of queries.jsonl and corpus.jsonl are the same.
Is that on purpose or a bug?
Looking forward to hearing from you, thanks!
Congrats on this very well structured, documented and helpful framework for figuring out whats going on in IR - especially on OOD data. Keep up the good work!
When loading DPR models from HF modelhub like:
model = DRES(models.SentenceBERT((
"facebook/dpr-question_encoder-multiset-base",
"facebook/dpr-ctx_encoder-multiset-base",
" [SEP] "), batch_size=128))
I run into an NotImplementedError: Make sure
_init_weigths is implemented for <class 'transformers.models.dpr.modeling_dpr.DPRQuestionEncoder'>
I know you converted the model to a sentencetransformers already and can be loaded like this but an interoperability with the HF hub would be slick - also for other DPR models in other languages like French or German.
Thanks
How would you assign numbers in the QRel file for BioASQ?
Could you please provide an example. Let's say for a question Q with id ID the ideal documents are [D1,D2,D3]. How would the QRel entries look like?
Thank you
Such as BM25Search or others,
I see in load_train function of TrainRetriever use score to filter sample,
I think some experiments conclusions about use different QFilter
to train retriever should be take in consideration.
First I would like to thank you for this incredible framework, it has been of incredible help to me.
I have a question concerning the cqadupstack results. As there are 12 different corpus for that dataset, which mean did you used to compute your results? The mean of means (i.e. each dataset has equal weight) or the mean over all the queries (i.e. each query has equal weight)?
Thanks in advance.
Hi @NThakur20
There seems to be an issue when I try to run the evaluate_sbert.py in Colab. It was working all fine till yesterday. I have not made any change. Just pip installed beir, git cloned the beir repo and ran the python file without any change to the file. The error is something like this:
2021-11-10 15:30:08.003342: E tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/api/keras/optimizers
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2150, in _get_module
return importlib.import_module("." + module_name, self.__name__)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py", line 637, in <module>
class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, PushToHubMixin):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/lazy_loader.py", line 62, in __getattr__
module = self._load()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/lazy_loader.py", line 45, in _load
module = importlib.import_module(self.__name__)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.7/dist-packages/keras/__init__.py", line 25, in <module>
from keras import models
File "/usr/local/lib/python3.7/dist-packages/keras/models.py", line 20, in <module>
from keras import metrics as metrics_module
File "/usr/local/lib/python3.7/dist-packages/keras/metrics.py", line 26, in <module>
from keras import activations
File "/usr/local/lib/python3.7/dist-packages/keras/activations.py", line 20, in <module>
from keras.layers import advanced_activations
File "/usr/local/lib/python3.7/dist-packages/keras/layers/__init__.py", line 23, in <module>
from keras.engine.input_layer import Input
File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_layer.py", line 21, in <module>
from keras.engine import base_layer
File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 43, in <module>
from keras.mixed_precision import loss_scale_optimizer
File "/usr/local/lib/python3.7/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 18, in <module>
from keras import optimizers
File "/usr/local/lib/python3.7/dist-packages/keras/optimizers.py", line 26, in <module>
from keras.optimizer_v2 import adadelta as adadelta_v2
File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adadelta.py", line 22, in <module>
from keras.optimizer_v2 import optimizer_v2
File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py", line 37, in <module>
"/tensorflow/api/keras/optimizers", "keras optimizer usage", "method")
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/monitoring.py", line 361, in __init__
len(labels), name, description, *labels)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/monitoring.py", line 135, in __init__
self._metric = self._metric_methods[self._label_length].create(*args)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Another metric with the same name already exists.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2150, in _get_module
return importlib.import_module("." + module_name, self.__name__)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.7/dist-packages/transformers/models/__init__.py", line 19, in <module>
from . import (
File "/usr/local/lib/python3.7/dist-packages/transformers/models/layoutlm/__init__.py", line 22, in <module>
from .configuration_layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig
File "/usr/local/lib/python3.7/dist-packages/transformers/models/layoutlm/configuration_layoutlm.py", line 22, in <module>
from ...onnx import OnnxConfig, PatchingSpec
File "/usr/local/lib/python3.7/dist-packages/transformers/onnx/__init__.py", line 17, in <module>
from .convert import export, validate_model_outputs
File "/usr/local/lib/python3.7/dist-packages/transformers/onnx/convert.py", line 23, in <module>
from .. import PreTrainedModel, PreTrainedTokenizer, TensorType, TFPreTrainedModel, is_torch_available
File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2140, in __getattr__
module = self._get_module(self._class_to_module[name])
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2154, in _get_module
) from e
RuntimeError: Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Another metric with the same name already exists.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/content/beir/examples/retrieval/evaluation/dense/evaluate_sbert.py", line 2, in <module>
from beir.retrieval import models
File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/models/__init__.py", line 1, in <module>
from .sentence_bert import SentenceBERT
File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/models/sentence_bert.py", line 1, in <module>
from sentence_transformers import SentenceTransformer
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/__init__.py", line 3, in <module>
from .datasets import SentencesDataset, ParallelSentencesDataset
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/__init__.py", line 3, in <module>
from .ParallelSentencesDataset import ParallelSentencesDataset
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/ParallelSentencesDataset.py", line 4, in <module>
from .. import SentenceTransformer
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 27, in <module>
from .models import Transformer, Pooling, Dense
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/__init__.py", line 1, in <module>
from .Transformer import Transformer
File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/Transformer.py", line 2, in <module>
from transformers import AutoModel, AutoTokenizer, AutoConfig
File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2140, in __getattr__
module = self._get_module(self._class_to_module[name])
File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2154, in _get_module
) from e
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Another metric with the same name already exists.
Thanks for the great work.
When I run the Quick Example, I got a TypeError:
Traceback (most recent call last):
File "demo.py", line 34, in <module>
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
File "/home/renruiyang/anaconda3/envs/beir/lib/python3.6/site-packages/beir/retrieval/evaluation.py", line 63, in evaluate
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string})
TypeError: Unable to resolve all measures.
Could you help me solve this error?
Thanks a lot!
Hi again @NThakur20! I've got an interesting search project which consists of a golden set of search queries and their results, for a financial services domain search application. One of the search types is for financial analysts, e.g. a partial analyst name search query which is answered by the search engine with the full contact particulars for that analyst. Using a fine tune trained T5.1 large parameter model I am achieving 97%+ classification accuracy for observed searches but the issue here is generalization to new searches where the analyst exists only in the database and the model needs to generate the response based on unsupervised contact data that's in the contact analyst database. So the thought was to either train an MS MARCO T5 model in an unsupervised fashion on the contact database in hopes that it generalizes to unobserved search queries, or to populate a deep IR pipeline with those records and use that for the analyst contact retrieval. Is this a reasonable use case with BEIR?
Hi there, thanks for providing this nice resource!
Looking at your paper, I think your BM25 baselines are a bit low? You report 0.218 nDCG@10 on MS MARCO, if I'm not mistaken - from Table 2.
With Pyserini https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md - we can get, and this has been widely reproduced:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap -m ndcg_cut.10 collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map all 0.1957
recall_1000 all 0.8573
ndcg_cut_10 all 0.2340
So, 1.6 points higher?
I suspect all the BM25 results should be a bit higher, based on our experience: https://arxiv.org/abs/2104.05740
For many of the other datasets with dense labels, a competitive baseline - and widely acknowledged in the IR community - would be something like BM25+RM3.
We would be happy to work with you on building out Pyserini as the competitive baseline for this task... Please reach out!
Hello,
I'm working on the GenQ setting and find that the model loading part seems incorrect. In query_gen_and_train.py
, the model loading part is like:
#### Provide any sentence-transformers model path
model_path = "bert-base-uncased" # or "msmarco-distilbert-base-v3"
retriever = TrainRetriever(model_path=model_path, batch_size=64, max_seq_length=350)
However, the TrainRetriever
class doesn't have the argument model_path
. It seems that the error was introduced in this commit.
Shi Yu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.