Comments (14)
No problem :)
I am not reproducing their results but I used their technique in https://github.com/PaulLerner/ViQuAE
I’ll let you know once I’ll compare with other methods
from ranx.
Thank you very much, Paul!
I am happy to see that max-norm
outperforms default-minimum
.
To give you some context, I added/invented max norm
because the minimum score is often unknown.
We usually fuse only the top retrieved documents from each model, which makes min-max
(in this specific context) not very sound to me.
I did not do extensive experimentation but from my experience max norm
outperforms min-max
very often.
from ranx.
I never used ZMUV
, to be honest. I implemented it for completeness and tried it for comparison purposes but never got better results than min-max
, max
, or sum
, which sometimes works the best.
In general, I prefer local normalization schemes because they are "unsupervised" and can be used out of the box.
Without strong empirical evidence that default-minimum
(w/ or w/o ZMUV
) works better than min-max
, max
, or sum
, I would not use it.
Also, without a standardized way of normalizing/fusing results is often difficult to understand what brings improvements over the state-of-the-art. Conducting in-depth ablation studies is costly, and we often lack enough space on conference papers to write about them.
from ranx.
Hi Paul,
Are you referring to this passage?
Finally, there are a few more details of exactly how to combine BM25 and DPR scores worth exploring. As a baseline, we tried using the raw scores directly in the linear combination (exactly as above). However, we noticed that the range of scores from DPR and BM25 can be quite different. To potentially address this issue, we tried the following normalization technique: If a document from sparse retrieval is not in the dense retrieval results, we assign to it the the minimum dense retrieval score among the retrieved documents as its dense retrieval score, and vice versa for the sparse retrieval score.
If so, as the authors say, this is a normalization technique, not a fusion method.
You can easily implement it and run it before passing the runs to ranx
.
Also, you can bypass the normalization step of fuse
and optimize_fusion
by passing norm=None.
I did not have the time to read the entire paper, but this "normalization" method is not very sound to me.
Normalization should make relevance scores computed by different systems comparable, and this is not the case.
There is any comparison between their method and simple min-max normalization in the paper?
from ranx.
Are you referring to this passage?
Yes
If so, as the authors say, this is a normalization technique, not a fusion method. You can easily implement it and run it before passing the runs to ranx.
Ok, then I’ll close this issue for now, thanks for the tips!
I did not have the time to read the entire paper, but this "normalization" method is not very sound to me. Normalization should make relevance scores computed by different systems comparable, and this is not the case.
I think both techniques can be combined. The default-minimum technique helps because of the top-K cutoff.
There is any comparison between their method and simple min-max normalization in the paper?
No
from ranx.
Would you mind computing the results for wsum
with that normalisation method vs min-max
and max
norms and post them here? Thanks.
from ranx.
What do you mean? Reproducing the experiments of Ma et al. ? Or on some dummy runs?
from ranx.
By the way, at the moment, my use case is to use the default-minimum trick of Ma et al.: when combining results from systems A and B, it consists in giving the minimum score of A's results if a given document was only retrieved by system B, and vice-versa.
I did not have the time to read the entire paper, but this "normalization" method is not very sound to me. Normalization should make relevance scores computed by different systems comparable, and this is not the case.
I think both techniques can be combined. The default-minimum technique helps because of the top-K cutoff.
P.S. Especially if you use ZMUV normalization, a document absent from A’s results would effectively have a score of 0, so an average score instead of a bad score
from ranx.
Sorry, I assumed you were reproducing the results from that paper...
If you try that normalization method on whatever non-dummy runs, could please check if it allows reaching better results than the normalization methods implemented in ranx
?
from ranx.
So, for what it’s worth, combined with a global ZMUV normalization (mean and std computed over the whole dataset instead of being query-dependent) and wsum
fusion, the default-minimum technique helps to fuse DPR and CLIP (as described in https://hal.archives-ouvertes.fr/hal-03650618), on ViQuAE’s dev set:
With:
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.323 |
(0.2, 0.8) | 0.327 |
(0.3, 0.7) | 0.335 |
(0.4, 0.6) | 0.340 |
(0.5, 0.5) | 0.342 |
(0.6, 0.4) | 0.333 |
(0.7, 0.3) | 0.293 |
(0.8, 0.2) | 0.242 |
(0.9, 0.1) | 0.168 |
(1.0, 0.0) | 0.127 |
Without:
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.295 |
(0.1, 0.9) | 0.313 |
(0.2, 0.8) | 0.316 |
(0.3, 0.7) | 0.315 |
(0.4, 0.6) | 0.310 |
(0.5, 0.5) | 0.299 |
(0.6, 0.4) | 0.276 |
(0.7, 0.3) | 0.259 |
(0.8, 0.2) | 0.238 |
(0.9, 0.1) | 0.215 |
(1.0, 0.0) | 0.165 |
I implemented it in pure python but I guess you would like it in numba:
def default_minimum(runs):
# union results
all_documents = {}
for run in runs:
for q_id, results in run.run.items():
all_documents.setdefault(q_id, set())
all_documents[q_id] |= results.keys()
# set default-minimum in runs
for run in runs:
for q_id, results in run.run.items():
minimum = min(results.values())
for d_id in all_documents[q_id]:
results.setdefault(d_id, minimum)
return runs
from ranx.
I can convert it in Numba, don't worry.
Could you please post the results for min-max
and max
norms with and without that approach?
Thank you!
from ranx.
Mmh that brings me to another feature request (opening another issue)
from ranx.
Hi, so it turns out that it depends a lot on the normalization method. Your zmuv
(query-dependent) works worse than my custom ZMUV over the whole dataset (results above), but the overall best is max-normalization, without default-minimum. Maybe this will depend on the fusion method though. See results below.
With default minimum
Norm: zmuv, Method: wsum. Best parameters: {'weights': (0.3, 0.7)}.
Weighted SUM
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.323 |
(0.2, 0.8) | 0.323 |
(0.3, 0.7) | 0.324 |
(0.4, 0.6) | 0.322 |
(0.5, 0.5) | 0.306 |
(0.6, 0.4) | 0.248 |
(0.7, 0.3) | 0.176 |
(0.8, 0.2) | 0.148 |
(0.9, 0.1) | 0.142 |
(1.0, 0.0) | 0.127 |
Norm: min-max, Method: wsum. Best parameters: {'weights': (0.4, 0.6)}.
Weighted SUM
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.323 |
(0.2, 0.8) | 0.324 |
(0.3, 0.7) | 0.328 |
(0.4, 0.6) | 0.329 |
(0.5, 0.5) | 0.269 |
(0.6, 0.4) | 0.169 |
(0.7, 0.3) | 0.153 |
(0.8, 0.2) | 0.146 |
(0.9, 0.1) | 0.141 |
(1.0, 0.0) | 0.127 |
Norm: max, Method: wsum. Best parameters: {'weights': (0.3, 0.7)}.
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.329 |
(0.2, 0.8) | 0.338 |
(0.3, 0.7) | 0.339 |
(0.4, 0.6) | 0.320 |
(0.5, 0.5) | 0.280 |
(0.6, 0.4) | 0.242 |
(0.7, 0.3) | 0.191 |
(0.8, 0.2) | 0.160 |
(0.9, 0.1) | 0.146 |
(1.0, 0.0) | 0.127 |
Without default minimum
Norm: zmuv, Method: wsum. Best parameters: {'weights': (0.2, 0.8)}.
Weighted SUM
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.323 |
(0.2, 0.8) | 0.323 |
(0.3, 0.7) | 0.322 |
(0.4, 0.6) | 0.318 |
(0.5, 0.5) | 0.294 |
(0.6, 0.4) | 0.251 |
(0.7, 0.3) | 0.196 |
(0.8, 0.2) | 0.159 |
(0.9, 0.1) | 0.147 |
(1.0, 0.0) | 0.135 |
Norm: min-max, Method: wsum. Best parameters: {'weights': (0.4, 0.6)}.
Weighted SUM
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.323 |
(0.2, 0.8) | 0.324 |
(0.3, 0.7) | 0.328 |
(0.4, 0.6) | 0.329 |
(0.5, 0.5) | 0.313 |
(0.6, 0.4) | 0.169 |
(0.7, 0.3) | 0.154 |
(0.8, 0.2) | 0.145 |
(0.9, 0.1) | 0.141 |
(1.0, 0.0) | 0.127 |
Norm: max, Method: wsum. Best parameters: {'weights': (0.2, 0.8)}.
Weighted SUM
Weights | MRR@100 |
---|---|
(0.0, 1.0) | 0.322 |
(0.1, 0.9) | 0.351 |
(0.2, 0.8) | 0.351 |
(0.3, 0.7) | 0.350 |
(0.4, 0.6) | 0.350 |
(0.5, 0.5) | 0.327 |
(0.6, 0.4) | 0.173 |
(0.7, 0.3) | 0.172 |
(0.8, 0.2) | 0.173 |
(0.9, 0.1) | 0.172 |
(1.0, 0.0) | 0.127 |
from ranx.
Note that the results above depend on the models. With other models I found default-minimum to be essential to ZMUV normalization, which really makes sense to me, as I’ve said above.
What’s your opinion about ‘‘global’’ normalization? e.g. for ZMUV, computing the mean and std over the whole dataset instead of per-query?
By the way, I originally went for ZMUV + weighted sum because of https://doi.org/10.1145/502585.502657
from ranx.
Related Issues (20)
- question: why student rather than fisher stat test? HOT 4
- [Feature Request] Add interpolated recall-precision plot function HOT 4
- [BUG] Missing results causes AssertionError HOT 1
- PSP@k: Propensity-scored precision at k HOT 9
- [Feature Request] Expose DCG as metric HOT 3
- [BUG] dcg and dcg_burges do not work in the compare function HOT 2
- [Feature Request] Use black to indent the code HOT 1
- [BUG] RBP with multiple relevance levels HOT 3
- [Feature Request] Support gzipped files? HOT 3
- [Feature Request] memory issue / make Run more efficient HOT 2
- Incorrect result for f1 score HOT 13
- Zero-scored documents HOT 10
- [BUG] Misleading exception message on dataframe types HOT 2
- [BUG] Issues when storing/loading Qrels from a dataframe and a parquet file. HOT 6
- [Feature Request] Run.from_df and Run.from_parquet does not allow specifying run name HOT 1
- Question on rank aggregation usage HOT 4
- Getting "Segmentation fault (core dumped)" error HOT 2
- [Feature Request] stddev statistic HOT 3
- Couldn't find any documentation about Qrel and run score range HOT 2
- [Feature Request] Propensity-scored Metrics HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ranx.