Comments (10)
So for chinese we have a few models that work (https://huggingface.co/naver/neuclir22-splade-zh and https://huggingface.co/naver/neuclir22-pretrained-zh) but they are mostly trained from scratch. Unfortunately there are some problems using roberta for SPLADE (see Figure 2 of https://user.eng.umd.edu/~oard/pdf/desires22.pdf). For our models we explain a bit how we trained these models in https://arxiv.org/pdf/2303.11171.pdf and https://arxiv.org/pdf/2301.10444.pdf.
Hope this helps, let me know if you have more questions
from splade.
Without any update, I'm closing this, feel free to reopen if needed
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed.
在哪改啊老铁,求指导
from splade.
from splade.
你在你的所有代码文件里面全局搜索一下512,然后替换成511就行了
…
---Original--- From: "Yue @.> Date: Mon, Feb 26, 2024 15:18 PM To: @.>; Cc: @.@.>; Subject: Re: [naver/splade] Can SPLADE adapt to Chinese language ? (Issue #44) I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error: indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed I met this before, I don't know why but I revised the input length from 512 to 511, the error was fixed. 在哪改啊老铁,求指导 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
我只在这个文件https://huggingface.co/naver/neuclir22-splade-zh/blob/main/config.json里面看到"max_position_embeddings": 514这个参数额,没有找到512相关的参数
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
Thank you for your reply. I tried this seeting 'return_token_type_ids=False
' but it gave me another error if the input is long RuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514]
I finally solve this problem by setting 'max_length=514'
to tokenization
from splade.
I tried this model([/neuclir22-splade-zh]) and I found if the input text is long. it will give me this error:
indexSelectLargeIndex: block: xxx, thread: xxx Assertion srcIndex < srcSelectDimSize failed
I would suggest trying to remove the token_type_ids (adding something like return_token_type_ids=False to the tokenization). We had some problems with that before
Thank you for your reply. I tried this seeting '
return_token_type_ids=False
' but it gave me another error if the input is longRuntimeError: The expanded size of the tensor (xxx) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, xxx]. Tensor sizes: [1, 514]
I finally solve this problem by setting 'max_length=514'
to tokenization
oh great. I would recommend limiting to 512 though, it would make more sense with the training
from splade.
Related Issues (20)
- Chunk token limit for SPLADE sparse embeddings? HOT 4
- Indexing a document corpus with Efficient SPLADE HOT 4
- [Bug] Get PyTorch version HOT 2
- Proposed Dockerfile
- Tutorial to export a SPLADE model to ONNX HOT 6
- Whether the SPLADE model supports the distinction of 'is_q'? HOT 1
- SPLADE representations on BEIR dataset HOT 1
- Quick Start Problem: an unexpected keyword argument 'version_base' HOT 1
- Is it possible to get a commercial license? HOT 5
- Installation error - splade with tokenisers v0.12.1 – Compatibility issue with Python 3.11.1 and Rust (v. 1.72, 1.76, 1.69, 1.62)
- PyTorch version checking
- Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model HOT 4
- TypeError: main() got an unexpected keyword argument 'version_base' HOT 1
- How to install the ENV correctly?
- Inference Experiments HOT 2
- Change default to splade-v3
- Seeking Assistance with SPLADE Model for Chinese Text
- bug: TREC 2020 qrel_binary.json, score 1 should be treated as negative instead of positive
- Hybrid search & normalization
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splade.