Comments (6)
@nirmal2k the model mentioned above is a uniCOIL model
castorini/unicoil-d2q-msmarco-passage
We dont have cls_proj
for unicoil.
In the huggingface model, the tok_proj
parameters are under key coil_encoder.tok_proj
. inside pytorch_model.bin
Here we defined the class to directly load the model from huggingface
https://github.com/castorini/pyserini/blob/88479fdecdb5d44934f4d91be56487148ec6601c/pyserini/encode/_unicoil.py#L27
from coil.
Yes, we have our checkpoint released on huggingface
castorini/unicoil-d2q-msmarco-passage
And here is a reproduce doc for your reference (in pyserini):
https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md
from coil.
Great, thank you!
My goal is to use uniCOIL and COIL with a different corpus. Is there also any guide that covers how to apply it to a different text collection (e.g., how to prepare a corpus input data that can be used with the run_marco.py
script for text encoding)?
My current understanding is that I would need to train a model (or use the model you just shared). Then I need to apply the model to pre-process the text collection using the commands described in the Encoding subsection, then convert the run_marco.py
's output to an Anserini's data JSONL collection input format using doc_emb_to_jsonl.py
, and the follow the standard Anserini commands (with the necessary adjustments) described in the doc that you linked.
Is this correct?
from coil.
Yes it is correct.
The current input (the json) file is tokenized passage in following format: "text": " [SEP] ".
In uniCOIL work, we generate expansions by docTTTTTquery.
For your own corpus, you can either use your own expansion for passage or simply do it without expansion.
from coil.
Thanks very much!
from coil.
I think I'm missing something. May I know where to find tok_proj and cls_proj models? Is it a part of the model at huggingface?
from coil.
Related Issues (20)
- Retrieval latency is very large with one thread
- Question about the result on "MSMARCO Passage Leadboard".
- Question about COIL-full HOT 1
- pyarrow.lib.ArrowNotImplementedError during training phrase HOT 1
- Error with loading dataset HOT 2
- How is document expansion helpful if p_max_len=192 in unicoil training and encoding command? Most MSMARCO passages are over 192 tokens HOT 1
- Padding Tokens - in the inverted list index HOT 2
- How do I load the model saved using unicoil training script using pyserini UnicoilDocumentEncoder ?
- Did you remove punctuations before computing the document score? HOT 2
- Small typo and bug? HOT 2
- Guide to do search with COIL/uniCOIL
- training data for UniCoil - links not working HOT 1
- Training error HOT 2
- How to use GPU to retrieve? HOT 4
- Dataset error when encoding document HOT 1
- Describe C-COIL approach HOT 5
- Default training command - Issues when encountering documents longer than 512 HOT 1
- Reproducing dense retriever results HOT 1
- TSV's aren't eval compliant HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from coil.