naver / splade Goto Github PK

SPLADE: sparse neural search (SIGIR21, SIGIR22)

License: Other

Jupyter Notebook 3.97% Python 95.19% Shell 0.85%

bert information-retrieval nlp passage-retrieval splade sparse

splade's Issues

Installation error - splade with tokenisers v0.12.1 – Compatibility issue with Python 3.11.1 and Rust (v. 1.72, 1.76, 1.69, 1.62)

Splade has tokenizers v0.12.1 as a dependency which seems to have a known conflict with multiple versions of Rust. Can we please update the dependency to a version >0.14.1?

warning: variable does not need to be mutable
         --> tokenizers-lib\src\models\unigram\model.rs:265:21
          |
      265 |                 let mut target_node = &mut best_path_ends_at[key_pos];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
          |
          = note: `#[warn(unused_mut)]` on by default
     
      warning: variable does not need to be mutable
         --> tokenizers-lib\src\models\unigram\model.rs:282:21
          |
      282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
     
      warning: variable does not need to be mutable
         --> tokenizers-lib\src\pre_tokenizers\byte_level.rs:200:59
          |
      200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
          |                                                           ----^^^^^^^
          |                                                           |
          |                                                           help: remove this `mut`
     
      error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib\src\models\bpe\trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |
          = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
          = note: `#[deny(invalid_reference_casting)]` on by default

  warning: `tokenizers` (lib) generated 3 warnings
  error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted

Alternatively, if anyone knows how to install Splade without this issue, please advise.

Fine Tuning

Do you have example of finetuning?

Zero-dimension query embedding

In the notebook I made some modifications and I get back a zero-dimensional embedding. Specifically I wanted to see the bow representation of a quoted search query using the efficient-splade models. Is it expected for the model to sometimes return zero-dimensional embeddings? Without the quotes it generates an expected representation.

model_type_or_dir = "naver/efficient-splade-V-large-query"
q_model_type_or_dir = "naver/efficient-splade-V-large-doc"

# loading model and tokenizer

model = Splade(model_type_or_dir, q_model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(q_model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

# example document from MS MARCO passage collection (doc_id = 8003157)

query = '"a big fat potato"'

# now compute the document representation
with torch.no_grad():
    inputs = tokenizer(query, return_tensors="pt")
    print(inputs)
    query_rep = model(q_kwargs=inputs)["q_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(query_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = query_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

OSError: Unable to load weights from pytorch checkpoint file for '/home/txh/devices/wjj/splade/weights/flops_best' at '/home/txh/devices/wjj/splade/weights/flops_best/pytorch_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I don't know what the problem is?

Chunk token limit for SPLADE sparse embeddings?

Hi there, I can't seem to find this documented, but is there a maximum or optimal text chunk size when creating sparse embeddings?

Thank you!

Great job!

it's glad to see the sparse IR models, specially, SPLADE and SPLADE++, achieves such good performances.

YAML Installation doesn't work from macOS with mini conda

OS:

macOS Ventura 13.3.1 
conda 23.1.0

Command:
conda env create -f conda_splade_env.yml

Output:

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - libgfortran4==7.5.0=ha8ba4b0_17
  - protobuf==3.19.1=py38h295c915_0
  - gmp==6.2.1=h2531618_2
  - gcc_impl_linux-64==9.3.0=h70c0ae5_19
  - libwebp==1.2.2=h55f646e_0
  - freetype==2.11.0=h70c0345_0
  - libffi==3.3=he6710b0_2
  ....

Training by dot product and evaluation via inverted index?

Hey,
I recently read your SPLADEv2 paper. That's so insightful! But I still have a few questions about it.

Is the model trained with dot product similarity function included in the contrastive loss?
Evaluation on MS MARCO is performed via inverted index backed by anserine?
Evaluation on BEIR is implemented with sentencetransformer hence also via dot product?
How much can you gurantee the sparsity of learned representation since it's softly regularized by L1 and FLOPS loss? Did you use a tuned threshold to ''zerofy'' ~0 value?

Training SPLADE with a smaller dataset?

Hello,

Thank you for researching SPLADE & setting up the GitHub REPO in such an easy-to-access way. I am trying to experiment with SPLADE and was wondering if it was possible to use the MSMarco small dataset to train instead of the larger one. I was curious since it was taking ~30 hours/epoch to train with the data from the distli_from_ensemble data config file.

Clustering

Maybe a stupid question, but you can't use SPLADE for clustering, right?

FLOPs calculation

I recently read your SPLADE paper and I think it's quite interesting. I have a question concerning FLOPs calculation in the paper.

I think computing FLOPs for an inverted index involves the length of the activated posting lists(the overlapping terms in query and document). For example, a query a b c and a document c a e, since we must inspect the posting list of the overlapping terms a and c, the flops should at least be

posting_length(a) + posting_length(c)

because we perform summation for each entry in the posting list. However, in the paper you compute FLOPs by the probability that a, b, c are activated in the query and c, a, e are activated in the document. I think this may underestimate the flops of SPLADE because the less sparse the document, the longer posting lists in the inverted index.

Equation (1) and (4)

In your paper, you said equation (1) is equivalent to the MLM prediction and E_j in equation (1) denotes the BERT input embedding for token j. If you use the default implementation of HuggingFace Transformers, E_j is not from the input layer but another embeddings matrix, which is called "decoder" in the "BertLMPredictionHead" (if you use BERT). Did you manually set the "decoder" weights to the input embedding weights?

My other question is concerning equation (4). It computes the summation of the weights of the document/query terms. In the "forward" function of the Splade class (models.py) however, you use "torch.max" function. Can you explain this issue?

Is it possible to get a commercial license?

We are a startup who is attempting to sell a new search API with some of the latest open source text embedding and cross-encoder models. We ourselves are BSL-licensed so I totally understand not wanting someone to commercialize your work and give nothing back.

SPLADE is obviously a very good and well-tested sparse-vector encoder and alternative to BM_25 and we would like to include it in our commercial product if possible.

Is there a way to get a commercial license? We are still a tiny company without much revenue for an outright purchase, but maybe we could setup some kind of channel partnership? Not sure, but it would be amazing to figure something out if at all possible. Thanks!

Training code

When will the training code be published?

Can SPLADE adapt to Chinese language ?

Hi, I am interested in your great work. I tried to tran a SPLADE model based on Roberta from huggingface https://huggingface.co/hfl/chinese-roberta-wwm-ext in my retrieval task over Chinese corpus.
But the result is not satisfied. In inference stage, my codes are as follows,

texts = ['王者荣耀好玩吗', '带你上王者', '如何下载王者荣耀', '鲁班怎么利用普通攻击']
embeds = batch_embed_doc(texts=texts, encoder=encoder, tokenizer=tokenizer, max_len=max_doc_len)
for i in range(len(texts)):
    print(texts[i])
    print(tokenizer.decode(embeds[i].topk(k=40).indices))

Then, I got the result:

王者荣耀好玩吗
700 喺帐 卷喉鲱 st fgo 蠹44 改判 短淇 貂 混华 oil賽 陇 谁 00 邇 呐 ssd 踝 ⒈ 2014 洞 天ᅦ 诰 or 西 乌 京 艷對 鬼 nt
带你上王者
700 呐 nt st 爸淇ᅦ 踝 git 艷鲱 dyson 貂淮44 ( 输 卷 购 53 才 葦 誣鼹 is 揶項θ 佈 cdma 贡 i3 ｛ 马 fgo 邇 搜 以 乌帐
如何下载王者荣耀
700帐 喺喉 fgo 卷判 貂 短44鲱 st 蠹华 改 谁 00 oil淇 陇賽 混 ssd 踝 2014 or ⒈ 天ᅦ 邇 艷 射 璉 京浣 战 載對 跚 呐
鲁班怎么利用普通攻击
職 诰 据 尖閏哄my ２０尔x 漏 表 才 剃 32g5s gohappymic 灞首缆 塊 互 山 种 怡 购椎 麒 奈級曇 膏 洛污 唔 find 躁

Here are two questions come to me,
1.Can SPLADE adapt to Chinese language?
2.What should I do to extend SPLADE to Chinese corpus?

Hybrid search & normalization

Hello! I see many articles (like pinecones) that use the following ways to combine the hybrid search results from dense vector and splade.

However i'm a bit confused of how it would work if the dense vectors are normalized to 1, but splade's output is not. any thoughts. What is the best way to conduct hybrid search with both vectors?

I understand the ANN search is done with dot product, so we would just use the highest score and not try to normalize?

def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse

i seee this prior issue: #34 but it seemed inconclusive

How to install the ENV correctly?

Hi,

In your readme file:

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

makes me confused as conda create -n splade_env python=3.9 creates an ENV named splade_env without installing any packages while conda env create -f conda_splade_env.yml directly installs an ENV named splade with python=3.8 and all the required package. In the following instructions, splade_env is utilized however splade_env is just a python=3.9 env without any packages installed.

BTW, I installed ENV splade by conda env create -f conda_splade_env.yml but got error TypeError: main() got an unexpected keyword argument 'version_base' when I run config config_splade++_cocondenser_ensembledistil`.

How to solve this ENV issue? Thanks.

Zhiyuan.

Evaluation on MSMARCO?

Hi, thanks for your very interesting work.

Could you share how you evaluate to get the results here.
Did you use inverted indexing or use this code?
I am trying the later approach, but it is very slow on MSMARCO.
Thank you

Cannot train SPLADEv2 to achieve the reported performance.

I want to train a SPLADEv2 from scratch, but the model seems to converge at MRR@10=0.3034,Recall@100=0.8292.Recall@1000=0.9461. I am using the lambda_d=1e-4, lambda_q=3e-4. Do you have any suggestions? Thank you.

When do you drop a term?

I understand that the log-saturation function and regularization loss suppress the weights of the frequent terms. But when do you drop a term (setting the term weight to zero)? Is it when the logit is less or equal to zero, so that the log(1+ReLu(.)) function outputs zero?

SPLADE representations on BEIR dataset

Hi,
thank you for sharing and maintaining this repo! I am willing to generate the SPLADE representations both for documents and queries for all the datasets in BEIR, similarly to what it is possible to do with the create_anserini script for the MSMARCO dataset. I would like to do it both for splade-cocondenser-ensembledistil and efficient-splade-V-large.

I tried to run the following script,

export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"

for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
        config.pretrained_no_yamlconfig=true \
        +beir.dataset=$dataset \
        +beir.dataset_path=data/beir \
        config.index_retrieve_batch_size=100
done

but I get NDCG=0.001 on the arguana dataset (then, I stopped the script because I guess that there is something wrong). What I am doing wrong? Also, does this script save the embeddings of each dataset? If not, how can I force it to save them?

Flops calcualtion

Hello!

I find that when I run flops, it always returns Nan.

I see your last commit fixed "force new", and changed line 25 in transformer_evaluator.py to force_new=True,
but in inverted_index.py line 23, seems that the self.n will return 0 if force_new is True.

The flops no longer return nan after I remove the "force_new=True".

Am I doing sth wrong here? And how should I get the correct flops..

Thank you!
Allen

bug: TREC 2020 qrel_binary.json, score 1 should be treated as negative instead of positive

Hi,

For TREC, 0,1 are negative, 2,3 are positive. However, in qrel_binary.json, 1 is mapped to positive (1). For instance, in qrel.json:

{"23849": {"1020327": 2, "1034183": 3, "1120730": 0, "1139571": 1

in qrel_binary.json:

{"23849": {"1020327": 1, "1034183": 1, "1120730": 0, "1139571": 1,

doc 1139571 should be mapped to 0 in qrel_binary.json.

Normalizing SPLADE embeddings - a bad idea?

Hi!

I'm using SPLADE together with sentence-transformers/multi-qa-mpnet-base-cos-v1 SentenceTransformer to create hybrid embeddings for use in Pinecone's sparse-dense indexes.

The sparse-dense indexes can only use dotproduct similarity, which is why I chose a dense model trained with cosine similarity. This means I get back dense embeddings with L2 norm of 1 and dot product similarity in range [-1, 1] which I can easily rescale to the unit interval. Based on my somewhat limited understanding, this seems like a relatively sound approach to getting scores which our users can understand as % similarity (assuming in distribution).

After transitioning to sparse-dense vectors, I noticed that SPLADE does not produce normalized embeddings, which means this approach no longer works. I thought about normalizing the SPLADE embeddings, but I'm not sure how this would affect performance.

On a separate note, I'm using Pinecone's convex combination

# alpha in range [0, 1]
embedding.sparse.values = [
    value * (1 - alpha) for value in embedding.sparse.values
]
embedding.dense = [value * alpha for value in embedding.dense]

I am struggling to reason about how all of this interacts and what effect it has on ranking. See here for info on how pinecone's score is calculated and here for more details about their convex combination logic.

Any help understanding this stuff would be hugely appreciated 🙌

Cheers!

This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

This repo fails to clone either by manually git clone-ing and installing with pipenv:

pipenv install git+https://github.com/naver/splade.git#egg=splade

Error

Cloning into '/REDACTED/splade'...
remote: Enumerating objects: 524, done.        
remote: Counting objects: 100% (57/57), done.        
remote: Compressing objects: 100% (34/34), done.        
remote: Total 524 (delta 32), reused 24 (delta 23), pack-reused 467        
Receiving objects: 100% (524/524), 3.09 MiB | 18.16 MiB/s, done.
Resolving deltas: 100% (274/274), done.
Downloading weights/distilsplade_max/pytorch_model.bin (268 MB)
Error downloading object: weights/distilsplade_max/pytorch_model.bin (33a5b0a): Smudge error: Error downloading weights/distilsplade_max/pytorch_model.bin (33a5b0a696d7b540065aedf6a86a056df3ac5f074d5be43923f0315f8b8bf7c4): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/REDACTED/splade/.git/lfs/logs/20230318T195557.182673.log'.
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: weights/distilsplade_max/pytorch_model.bin: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'


Would you like to retry cloning ?

Please help resolve.

Is there an alternative way to install?

Dockerized environment to run splade

Can you please provide a Dockerfile to run splade? The conda env has dependencies on binaries built against nvidia-cuda for linux platform. I cannot build it for non-cuda linux and osx. I tried replacing those dependencies with their replacement built for usage with cpu, however, still did not manage to make it work.

A Dockerfile should allow users to use splade more easily across platforms.

Indexing a document corpus with Efficient SPLADE

What is the process for indexing MS MARCO using Efficient SPLADE?

I see a Dropbox link to download a pre-built index for MS MARCO, and a command to use PISA's query evaluation to retrieve from that index. However, I'd like to reproduce the indexing stage for this and other IR datasets.

No training code

Multilingual version of SPLADE

I'm very impressed by SPLADE, particularly the newest efficient versions. However, it is only trained on English texts.

There's an mMARCO dataset that has 14 languages, which is already in use by SBERT and other projects. Importantly, there's a doc2query mt5 model that uses this dataset. It seems to me that anyone using non-english (or multiple) languages would have no choice but to use this. A SPLADE version would be fantastic, especially if compared to the mT5 version of doc2query on BEIR zero-shot data!

Even better would be if you could somehow use the FLORES-200 dataset, which is used by the cutting edge NLLB-200 translation model!

Would you consider implementing a multilingual version in a future iteration of SPLADE? I think this would provide immense value to the global community!

Also, its not clear to me that the SPLADE++ methods were used as part of your efficient version. So, it would be great if you could use and compare it with the other methods.

Inference Experiments

Hey all,

I'm looking at the Efficiency Study paper and I'd like to replicate the query encoding numbers - could you please provide a pipeline or any other pointers so I can ensure my measurement is correct?

Thanks a lot!

Instructions on Using Pisa for Splade

Firstly, thanks for your series of amazing papers and well-organized code implementations.

The two papers Wacky Weights in Learned Sparse Representations and the
Revenge of Score-at-a-Time Query Evaluation and From Distillation to Hard Negative Sampling: Making Sparse
Neural IR Models More Effective show that using Pisa can make query retrieval much faster compared to using Anserini or code from the repo for Splade.

The folder efficient_splade_pisa/ in the repo contains the instructions on using Pisa for Splade but the instructions are only for processed queries and indexes. If I only have a well-trained Splade model, how can I process the outputs of the Splade model (sparse vectors or its quantized version for Anserini) to make them suitable for Pisa? Can you provide more specific instructions on this?

Best wishes

Whether the SPLADE model supports the distinction of 'is_q'?

I noticed relevant content in the code regarding it, but there is no specific model path provided for 'is_q'.

Regards

Change default to splade-v3

Hey,

should we change the default configuration from splade++ to splade-v3? I could make a PR for the readme if that makes sense.

TypeError: main() got an unexpected keyword argument 'version_base'

Howdy,

Sorry to bother, but I'm just trying to get the basic toy data training task to work on a fresh git clone and I'm running into the following error running:

python3 -m splade.train

Traceback (most recent call last):
  File "/home/vagrant/.conda/envs/splade_source/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vagrant/.conda/envs/splade_source/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/srv/repos/splade_source/splade/index.py", line 13, in <module>
    @hydra.main(config_path=CONFIG_PATH, config_name=CONFIG_NAME, version_base="1.2")
TypeError: main() got an unexpected keyword argument 'version_base'

is there something basic I'm missing?

configuration for splade++ results

Hi-- thanks for the nice work.

I'm trying to index+retrieve using the naver/splade-cocondenser-ensembledistil model. Following the readme, I've done:

export SPLADE_CONFIG_FULLPATH="config_default.yaml"
python3 -m src.index \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \ # <--- (from readme, using the new model)
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  index=msmarco  # <--- added

export SPLADE_CONFIG_FULLPATH="config_default.yaml"
python3 -m src.retrieve \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \ # <--- (from readme, using the new model)
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  config.out_dir=experiments/pre-trained/out-dl19 \
  index=msmarco \  # <--- added
  retrieve_evaluate=msmarco # <--- added

Everything runs just fine, but I'm getting rather poor results in the end:

MRR@10: 0.18084248646927734
recall ==> {'recall_5': 0.2665353390639923, 'recall_10': 0.3298710601719197, 'recall_15': 0.3694364851957974, 'recall_20': 0.3951050620821394, 'recall_30': 0.4270654250238777, 'recall_100': 0.5166069723018146, 'recall_200': 0.5560768863419291, 'recall_500': 0.606984240687679, 'recall_1000': 0.6402578796561604}

I suspect it's a configuration problem on my end, but since the indexing process takes a bit of time, I thought I'd just ask before diving too far into the weeds: Is there a configuration file to use for the splade++ results, and how do I use it?

Thanks!

Tutorial to export a SPLADE model to ONNX

Hello,

I trained a SPLADE model on my own recently. To reduce the inference time, I tried to export my model to ONNX with torch.onnx.export() but I encountered a few errors.

Is there a tutorial somewhere for this conversion?

Seeking Assistance with SPLADE Model for Chinese Text

Hello,

I am currently developing a SPLADE model focusing on Chinese text, and during the training process, I have encountered several issues that I hope you might be able to help me with:

I attempted to pre-train the model using the method outlined in the paper available at https://arxiv.org/pdf/2301.10444.pdf, but I am unsure if the problem lies in the implementation details. I observed that the sparse representation became entirely zeros during fine-tuning. For the FLOPS input, I am using log(1 + ReLU(y_logits)) and have also tried adding an MLM Loss specifically targeting log(1 + ReLU(y_logits)).

I found that the original MLM Loss + FLOPS (log(1 + ReLU(y_logits - 1))) yielded better training results than MLM + FLOPS (log(1 + ReLU(y_logits))).

LexMAE has shown satisfactory results on English text datasets, and I am curious to know if you have conducted any experiments on top of LexMAE's foundation.

I would greatly appreciate your response and any advice you can provide.

Thank you very much for your time.

Best regards,

Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model

Hello, I am currently developing a Japanese model and have been referencing the "ecir23-scratch-tydi-japanese-splade" model on Hugging Face for guidance. I would greatly appreciate it if you could share the specific settings, including the models and datasets used, to create this model. This information will be incredibly helpful for my project. Thank you in advance for your assistance.

url:https://huggingface.co/naver/ecir23-scratch-tydi-japanese-splade

Running SPLADE in production (Render python server)

Hi team, I'm working with my development team on using SPLADE for sparse embeddings (alongside dense embeddings from OpenAI Ada), with the end goal of having a hybrid search setup.

However, we keep running into memory issues creating embeddings for chunks of text.

I was wondering if you have any tips for running this in production or better still if there is an API available that you're aware of that can take text as an input and output spare embeddings?

Any help would be hugely appreciated.

Training code availability

When you are going to release the training code? Thank you!

Python version 3.8 or 3.9?

The requirements section in the ReadMe.md says 3.9
https://github.com/naver/splade/blob/main/README.md?plain=1#L55

But conda env says 3.8
https://github.com/naver/splade/blob/main/conda_splade_env.yml#L128

Which is preferred?

Use interactively without indexing?

Hi there!

I'm looking to use Splade to evaluate just a handful of examples in a programatic way (without reading/writing to disk). The inference_splade.ipynb script was super useful, but I'm looking to evaluate a query over a handful of document rather than only look at the document representations.

Is there an easy way to do this, or will I have to index and write to disk my small number of examples and then retrieve from that?

Thanks!

[Bug] Get PyTorch version

Hi, I believe there is a bug in the function to check if Pytorch version >= 1.6.

https://github.com/naver/splade/blob/main/splade/tasks/amp.py

import torch

# inspired from Colbert repo: https://github.com/stanford-futuredata/ColBERT

PyTorch_over_1_6 = float((torch.__version__.split('.')[1])) >= 6 and float((torch.__version__.split('.')[0])) >= 1

It returns false for a pytorch version '2.0.1+cu117' (google colab). Could you guys check it please?
I have replaced the function by another one:

PyTorch_over_1_6 = float(".".join([torch.__version__.split('.')[0], torch.__version__.split('.')[1]])) >= 1.6

Full example:

This error makes the code to break when using this pytorch version combined with fp16 = True.

Error message:
"Cannot use AMP for PyTorch version < 1.6"

From:

Benchmark Performance After Re-ranking?

I'm curious if you've run your model with a "second-stage" reranker, on the BEIR benchmarks.
Would you expect much benefit from this?

Thank you, and excellent work!

Proposed Dockerfile

Hello maintainers and community,

I've noticed that the project doesn't currently have a Dockerfile, so I've taken the initiative to create one. Dockerizing the project provides a consistent environment for both development and deployment, making it easier for contributors to get started and maintain the quality of the project.

What I've Done:

Created a Dockerfile to build and run the project
Tested it locally to ensure that it works as expected

How to Test the Docker Setup:

Build the Docker image: docker build -t [image-name] .
Run the Docker container: docker run [options] [image-name]
Run the splade.all: /opt/conda/envs/splade/bin/python -m splade.all config.checkpoint_dir=experiments/debug/checkpoint config.index_dir=experiments/debug/index config.out_dir=experiments/debug/out

Dockerfile

FROM continuumio/anaconda3:2022.05

RUN git clone https://github.com/naver/splade.git && cd splade
WORKDIR /splade

RUN conda create -n splade_env python=3.9
RUN conda env create -f conda_splade_env.yml

Placement of the Dockerfile:

I've placed the Dockerfile in the project root directory for now, but I'm open to suggestions if there's a more appropriate directory for it.

I would appreciate your feedback on this addition. If it aligns with the project's goals and you find it beneficial, I would be happy to submit a Pull Request.

Thank you for considering my proposal.

Quick Start Problem: an unexpected keyword argument 'version_base'

Hello. I've got TypeError problem with running a quick start example and during working with hf.train.py.

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/usr/splade/splade/all.py", line 6, in
from .index import index
File "/home/usr/s/splade/splade/index.py", line 13, in
@hydra.main(config_path=CONFIG_PATH, config_name=CONFIG_NAME, version_base="1.2")
TypeError: main() got an unexpected keyword argument 'version_base'

As far as I understand. it isn't a crucial argument. However, the removing of it from @hydra.main generates SystemError: 2

PyTorch version checking

This line

splade/splade/tasks/amp.py

Line 5 in 692276d

 PyTorch_over_1_6 = float(".".join([torch.__version__.split('.')[0], torch.__version__.split('.')[1]])) >= 1.6 

doesn't work properly just because there are versions over 1.6 that are less than 1.6 numerically. For example, versions >= 1.10

naver / splade Goto Github PK

splade's Issues

Error

What I've Done:

How to Test the Docker Setup:

Dockerfile

Placement of the Dockerfile:

Recommend Projects

Recommend Topics

Recommend Org