Git Product home page Git Product logo

amenra / ranx Goto Github PK

View Code? Open in Web Editor NEW
352.0 11.0 21.0 35.37 MB

⚑️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Home Page: https://amenra.github.io/ranx

License: MIT License

Python 88.20% Jupyter Notebook 11.30% Makefile 0.50%
ranking-metrics numba python evaluation evaluation-metrics information-retrieval recommender-systems information-retrieval-evaluation information-retrieval-metrics data-fusion

ranx's Introduction

PyPI version Download counter Documentation Status License: MIT Open in Colab

πŸ”₯ News

  • [August 3 2023] ranx 0.3.16 is out!
    This release adds support for importing Qrels and Runs from parquet files, exporting them as pandas.DataFrame and save them as parquet files. Any dependence on trec_eval have been removed to make ranx truly MIT-compliant.

⚑️ Introduction

ranx ([raΕ‹ks]) is a library of fast ranking evaluation metrics implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to evaluate and compare Information Retrieval and Recommender Systems. ranx allows you to perform statistical tests and export LaTeX tables for your scientific publications. Moreover, ranx provides several fusion algorithms and normalization strategies, and an automatic fusion optimization functionality. ranx also have a companion repository of pre-computed runs to facilitated model comparisons called ranxhub. On ranxhub, you can download and share pre-computed runs for Information Retrieval datasets, such as MSMARCO Passage Ranking. ranx was featured in ECIR 2022, CIKM 2022, and SIGIR 2023.

If you use ranx to evaluate results or conducting experiments involving fusion for your scientific publication, please consider citing it: evaluation bibtex, fusion bibtex, ranxhub bibtex.

NB: ranx is not suited for evaluating classifiers. Please, refer to the FAQ for further details.

For a quick overview, follow the Usage section.

For a in-depth overview, follow the Examples section.

✨ Features

Metrics

The metrics have been tested against TREC Eval for correctness.

Statistical Tests

Please, refer to Smucker et al., Carterette, and Fuhr for additional information on statistical tests for Information Retrieval.

Off-the-shelf Qrels

You can load qrels from ir-datasets as simply as:

qrels = Qrels.from_ir_datasets("msmarco-document/dev")

A full list of the available qrels is provided here.

Off-the-shelf Runs

You can load runs from ranxhub as simply as:

run = Run.from_ranxhub("run-id")

A full list of the available runs is provided here.

Fusion Algorithms

Name Name Name Name Name
CombMIN CombMNZ RRF MAPFuse BordaFuse
CombMED CombGMNZ RBC PosFuse Weighted BordaFuse
CombANZ ISR WMNZ ProbFuse Condorcet
CombMAX Log_ISR Mixed SegFuse Weighted Condorcet
CombSUM LogN_ISR BayesFuse SlideFuse Weighted Sum

Please, refer to the documentation for further details.

Normalization Strategies

Please, refer to the documentation for further details.

πŸ”Œ Requirements

python>=3.8

As of v.0.3.5, ranx requires python>=3.8.

πŸ’Ύ Installation

pip install ranx

πŸ’‘ Usage

Create Qrels and Run

from ranx import Qrels, Run

qrels_dict = { "q_1": { "d_12": 5, "d_25": 3 },
               "q_2": { "d_11": 6, "d_22": 1 } }

run_dict = { "q_1": { "d_12": 0.9, "d_23": 0.8, "d_25": 0.7,
                      "d_36": 0.6, "d_32": 0.5, "d_35": 0.4  },
             "q_2": { "d_12": 0.9, "d_11": 0.8, "d_25": 0.7,
                      "d_36": 0.6, "d_22": 0.5, "d_35": 0.4  } }

qrels = Qrels(qrels_dict)
run = Run(run_dict)

Evaluate

from ranx import evaluate

# Compute score for a single metric
evaluate(qrels, run, "ndcg@5")
>>> 0.7861

# Compute scores for multiple metrics at once
evaluate(qrels, run, ["map@5", "mrr"])
>>> {"map@5": 0.6416, "mrr": 0.75}

Compare

from ranx import compare

# Compare different runs and perform Two-sided Paired Student's t-Test
report = compare(
    qrels=qrels,
    runs=[run_1, run_2, run_3, run_4, run_5],
    metrics=["map@100", "mrr@100", "ndcg@10"],
    max_p=0.01  # P-value threshold
)

Output:

print(report)
#    Model    MAP@100    MRR@100    NDCG@10
---  -------  --------   --------   ---------
a    model_1  0.320ᡇ     0.320ᡇ     0.368α΅‡αΆœ
b    model_2  0.233      0.234      0.239
c    model_3  0.308ᡇ     0.309ᡇ     0.330ᡇ
d    model_4  0.366α΅ƒα΅‡αΆœ   0.367α΅ƒα΅‡αΆœ   0.408α΅ƒα΅‡αΆœ
e    model_5  0.405α΅ƒα΅‡αΆœα΅ˆ  0.406α΅ƒα΅‡αΆœα΅ˆ  0.451α΅ƒα΅‡αΆœα΅ˆ

Fusion

from ranx import fuse, optimize_fusion

best_params = optimize_fusion(
    qrels=train_qrels,
    runs=[train_run_1, train_run_2, train_run_3],
    norm="min-max",     # The norm. to apply before fusion
    method="wsum",      # The fusion algorithm to use (Weighted Sum)
    metric="ndcg@100",  # The metric to maximize
)

combined_test_run = fuse(
    runs=[test_run_1, test_run_2, test_run_3],  
    norm="min-max",       
    method="wsum",        
    params=best_params,
)

πŸ“– Examples

Name Link
Overview Open In Colab
Qrels and Run Open In Colab
Evaluation Open In Colab
Comparison and Report Open In Colab
Fusion Open In Colab
Plot Open In Colab
Share your runs with ranxhub Open In Colab

πŸ“š Documentation

Browse the documentation for more details and examples.

πŸŽ“ Citation

If you use ranx to evaluate results for your scientific publication, please consider citing our ECIR 2022 paper:

BibTeX
@inproceedings{ranx,
  author       = {Elias Bassani},
  title        = {ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},
  booktitle    = {{ECIR} {(2)}},
  series       = {Lecture Notes in Computer Science},
  volume       = {13186},
  pages        = {259--264},
  publisher    = {Springer},
  year         = {2022},
  doi          = {10.1007/978-3-030-99739-7\_30}
}

If you use the fusion functionalities provided by ranx for conducting the experiments of your scientific publication, please consider citing our CIKM 2022 paper:

BibTeX
@inproceedings{ranx.fuse,
  author    = {Elias Bassani and
              Luca Romelli},
  title     = {ranx.fuse: {A} Python Library for Metasearch},
  booktitle = {{CIKM}},
  pages     = {4808--4812},
  publisher = {{ACM}},
  year      = {2022},
  doi       = {10.1145/3511808.3557207}
}

If you use pre-computed runs from ranxhub to make comparison for your scientific publication, please consider citing our SIGIR 2023 paper:

BibTeX
@inproceedings{ranxhub,
  author       = {Elias Bassani},
  title        = {ranxhub: An Online Repository for Information Retrieval Runs},
  booktitle    = {{SIGIR}},
  pages        = {3210--3214},
  publisher    = {{ACM}},
  year         = {2023},
  doi          = {10.1145/3539618.3591823}
}

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🀘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

πŸ“„ License

ranx is an open-sourced software licensed under the MIT license.

ranx's People

Contributors

amenra avatar diegoceccarelli avatar maximedb avatar wojciechkusa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ranx's Issues

feature request: hits (or accuracy?)

Hi,

@osf9018 mentioned it in #2 but I guess it’s better to create a specific issue.

Motivation

It is often difficult to estimate the total number of relevant document for a query.
For example, in Question Answering, if you have a large enough Knowledge Base, you can find the answer to your question in a surprisingly large number of documents that one cannot annotate in advance. Because of this, the relevance of the document is often estimated on-the-go, by checking whether the answer string is in the document retrieved by the system.

Because of this, recall is not an appropriate metric. However, one way to circumvent this is to compute recall "as if" there was only a single relevant document. After averaging over the whole dataset, it corresponds to the proportion of question for which the system retrieved at least one relevant document in top-K. This is what @osf9018 and I call "hits@K" (I can’t remember but I’ve seen it in a paper) and others, such as Karpukhin et al., call "accuracy". Accuracy is a confusing term IMO.

The request

Would you be interested in implementing or integrating this feature in your library?
It might take some renaming but it could be implemented very easily by using the _hits function. It is simply min(1, _hits(qrels, run, k))

[Feature Request] Add interpolated recall-precision plot function

Is your feature request related to a problem? Please describe.
First of all: This is a really nice library! It helps a lot!
My request is regarding a recall-precision graphic. When I read TREC-related papers, they very often used the interpolated precision-recall-plot to visualize the performance of IR-Systems which are being compared to each other. They also use the graphs to understand which IR-System yields a higher recall value, shows a higher precision value, and generally has a better performance.

Describe the solution you'd like
Since I love using this library, it would be great, if there were a function for generating such a precision-recall-plot or just an example/notebook for generating this plot using the ranx library in the documentation.

Describe alternatives you've considered
I've already considered creating these plots myself with the functions that the library offers. However, it seems quite complex as the interpolation greatly complicates this work for me.

Additional context
Here is an example interpolated recall-precision plot:
image

More information can be found on:
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
and
https://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf (p.4)

[BUG] Precision calculation incorrect?

Describe the bug
In the below example, I would expect run1 to have a precision of 1.0 and I would expect both run2 and run3 to have precisions of 0.75, as 3 out of 4 returned documents are relevant. Instead the second query returns 0.5, and the third 0.25. Either there is a bug handling empty query results, or I have a naive misunderstanding of precision. Also, run 2 and run 3 are similar, just with different queries returning null results. Please correct me if I'm wrong!

To Reproduce
Steps to reproduce the behavior:

qrels_dict = {
    "q_1": {"doc_a": 1},
    "q_2": {"doc_b": 1, "doc_c": 1, "doc_d": 1},
    "q_3": {"doc_e": 1},
    "q_4": {"doc_f": 1},
}

run_dict_1 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}

run_dict_2 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {},
    "q_4": {"doc_f": 1.0},
}

run_dict_3 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}


qrels = Qrels(qrels_dict)
run1 = Run(run_dict_1)
run2 = Run(run_dict_2)
run3 = Run(run_dict_3)

print(evaluate(qrels, run1, ["precision"]))
print(evaluate(qrels, run2, ["precision"]))
print(evaluate(qrels, run3, ["precision"]))

1.0
0.5
0.25

ValueError: max() arg is an empty sequence

Hello,
I'd like to determine what query is causing the following error and how to get around it:

Traceback (most recent call last):
  File "main.py", line 43, in perform_tasks
    eval(params)
  File "main.py", line 25, in eval
    eval_helper.perform_eval()
  File "/home/celso/projects/XMTC-Baselines/source/helper/EvalHelper.py", line 62, in perform_eval
    qrels = Qrels(filtered_relevance_map)
  File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/ranx/data_structures/qrels.py", line 62, in __init__
    max_len = max(len(y) for x in doc_ids for y in x)
ValueError: max() arg is an empty sequence

My evaluation code is shown in the code snippet below.

ranking = self._retrieve(...)
filtered_relevance_map= {key: value for key, value in self.relevance_map.items() if key in ranking.keys()}
qrels = Qrels(filtered_relevance_map)
run = Run(ranking, name=cls)
result = evaluate(qrels, run, self.metrics, threads=12)

Question on rank aggregation usage

Thanks for your amazing work. I am very interested in this framework and try to use it to solve my rank aggregation problems. However, I am a little confused about the usage.

For example, I have scores for several items under different ranking rules, as follows:

item | rank1 | rank2 | rank3
item1 | 0.4 | 0.8 | 0.2
item2 | 0.8 | 0.7 | 0.7
item3 | 0.7 | 0.7 | 1.0
item4 | 0.5 | 0.6 | 0.7

Now I want a comprehensive rank that satisfies all the subrank (e.g., rank1, rank2, rank3) constraints as much as possible.

item | rank
item1 | 0.5
item2 | 0.9
item3 | 0.7
item4 | 0.4

I'm not sure if that can be addressed with ranx. If so, could you show me an example?
Thanks a lot.

Why in Recall@k you divide on len(relevant), but not min(len(relevant), k)

The question about Recall@k arose when I looked at the best scores R@1 of Stanford Online Products dataset in paperswithcode https://paperswithcode.com/sota/metric-learning-on-stanford-online-products-1. This benchmark use R@1 metric to measure best models and approach in retrieval task in SOP dataset. Sop dataset has 4.3 images per class (query), so maximum score R@1 with ranx formula would be 1 / 4.3.

In benchmark SOP and many others benchmark they use divide coefficient = min(len(relevant), k).

What do you think about overwrite this coefficient? Why papers with code write R@1 but it's actually not R@1 it's HitRate@1?

[BUG] Misleading exception message on dataframe types

Describe the bug
I'm using the library for the first time with a Pandas dataframe and ran into an exception that was misleading.

To Reproduce
Steps to reproduce the behavior:

  1. Create a dataframe where the id column is of type int64 e.g. df['id'] = df.index + 1
  2. Create the qrel like this:
qrels = Qrels.from_df(
    df=df,
    q_id_col="id",
    doc_id_col="best_document",
    score_col="score",
)
  1. Observe this error:
[/usr/local/lib/python3.10/dist-packages/ranx/data_structures/qrels.py](https://localhost:8080/#) in from_df(df, q_id_col, doc_id_col, score_col)
    293         """
    294         assert (
--> 295             df[q_id_col].dtype == "O"
    296         ), "DataFrame scores column dtype must be `object` (string)"
    297         assert (

AssertionError: DataFrame scores column dtype must be `object` (string)

Expected behavior
The assertion message should point to the correct column, in this case, it is the ID column that is of the wrong type. From inspecting the code, the assertion message is wrong when the document ID column is of the wrong type as well.

[Feature Request] Run.from_df and Run.from_parquet does not allow specifying run name

Is your feature request related to a problem? Please describe.
I'm using Pandas dataframes, and comparing different embedding models. I'd like to be able to name my runs so when I compare them, the report shows something other than run_1, run_2.

Describe the solution you'd like
Allow name as a named parameter in Run.from_df

Describe alternatives you've considered
You can just set the name afterward, e.g.

my_run.name = "My Run"

Additional context
I guess this is just for consistency, since Run.from_file allows you to pass a name.

[Feature Request] optimize norm and method in optimize_fusion

Is your feature request related to a problem? Please describe.
Instead of manually trying every possible fusion techniques, it’d be nice to be able to grid-search all of them, as optimize_fusion is already doing for fusion hyperparameters (e.g. weights in wsum).

Describe the solution you'd like
Allow to pass List[str] instead of str for norm and methodΒ parameters.
Then do something like:

for norm in norms:
    for method in methods:
        optimize_fusion_current_behavior(*args, norm=norm, method=method, **kwargs)

The trick is to report the results in a readable manner.

Why ranx is too slow in this simple example?

from ranx import Qrels
from ranx import Run
from ranx import evaluate

qrels_dict = {
    "text_1": {
        "label_1": 1
    },
    
    "text_2":{
        "label_2": 1,
    }
}

qrels = Qrels(qrels_dict, name="testing")



run_dict = {
    "text_1": {
        "label_1": 1,
        "label_2": 0.9,
        "label_3": 0.8,
        "label_4": 0.7,
        "label_5": 0.6,
        "label_6": 0.5,
        "label_7": 0.4,
        "label_8": 0.3,
        "label_9": 0.2,
        "label_10": 0.1,
    },
    
    "text_2": {
        "label_1": 0.9,
        "label_2": 1,
        "label_3": 0.8,
        "label_4": 0.7,
        "label_5": 0.6,
        "label_6": 0.5,
        "label_7": 0.4,
        "label_8": 0.3,
        "label_9": 0.2,
        "label_10": 0.1,
    },
}

run = Run(run_dict, name="bm25")

evaluate(qrels, run, ["mrr@1", "mrr@5", "mrr@10"])
CPU times: user 55.2 s, sys: 467 ms, total: 55.7 s
Wall time: 56.8 s

This behavior can be reproduced in this Google Colab Notebook 1 and also in this Google Colab Notebook 2 (spent time per steps).

[Feature Request] relevance_level parameter

I was wondering if similar to trec_eval that we can specify the relevance_level, with -l parameter, this feature exists in ranx. If not that would be a useful feature for evaluation

Problem with r-precision

Hi,

I tested your code and found that it was easy to use and integrate. Moreover, the results I got are fully coherent with those I previously obtained with a personal implementation of trec_eval and the computation of the measures is fast. This is clearly an interesting software and its presentation to the demo session of ECIR 2022 is a good thing.

I had only a problem with the R-precision measure. The main problem is that if you replace ""ndcg@5" in the 4th cell of the overview.ipynb notebook, you get:

`

TypeError Traceback (most recent call last)
/tmp/ipykernel_28676/2318072837.py in
1 # Compute NDCG@5
----> 2 evaluate(qrels, run, "r-precision")

/vol/data/ferret/tools-distrib/_research_code/rank_eval/rank_eval/meta_functions.py in evaluate(qrels, run, metrics, return_mean, threads, save_results_in_run)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
--> 151 run.scores[m][q_id] = scores[i]
152 # Prepare output -----------------------------------------------------------
153 if return_mean:

TypeError: 'numpy.float64' object does not support item assignment
`

I first detected the problem through the integration of your code and obtained the same error. By looking at the file meta_functions.py where the problem arises:

143 if type(run) == Run and save_results_in_run:
144 for m, scores in metric_scores_dict.items():
145 if m not in ["r_precision", "r-precision"]:
146 run.mean_scores[m] = np.mean(scores)
147 else:
148 run.scores[m] = np.mean(scores)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
151 run.scores[m][q_id] = scores[i]

I saw your recent last update of this part of the code but there is still a problem since for R-precision, the mean of the scores is stored in run.score and not in run.mean_scores. As a consequence, the use of run.scores for storing the score of each query raises a problem if both return_mean and save_results_in_run flags are set to True. More globally, I am not sure to understand why you differentiate R-precision from the other measures concerning the computation of the mean score.

Thank you by advance for your efforts for fixing the issue.

Olivier

Problems with MAP

I understood that, when evaluating MAP@k, relevance judgment scores equal to 0 are ignored.
In my case, I get a bit of a weird behaviour.

I'm working on a balanced dataset with binary relevancy and define qrels by including both 1s and 0s documents.
While ndcg@10 gives me results at about 0.7, MAP@10 is extremely low at about 0.10.

Can this be because, besides the very first documents, the model perform poorly or am I doing something wrong when evaluating?

qrels = Qrels.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="target_binary",
)

run = Run.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="predictions",
)

evaluate(qrels, run, ["map@10", "mrr", "ndcg@10"])

predictions in test_loaded_pdf is not a list of binary relevancy but it's a float relevancy score

Getting "Segmentation fault (core dumped)" error

Hello,
Thank you for your amazing work. I am trying to use supervised fusion methods like this:

best_params = optimize_fusion(
qrels=qrels,
runs=[run_4, run_5],
norm='min-max',  # Default value
method='posfuse',  
metric="mrr@10",  
)

combined_run = fuse(
runs=[run_4, run_5],
norm='min-max',  # Default value
method='posfuse',  # Alias for Weighted Sum
params=best_params,
)

I tried changing the norm and metric to every possible value but I still get "Segmentation fault (core dumped)" error.
I could not find any hints in the documentation about. sorry if I am missing something but can you help me with using these fusion methods?
thanks,

[Feature Request] memory issue / make Run more efficient

Hi Elias,

Is your feature request related to a problem? Please describe.
I've noticed that Run (and I guess also Qrels) consume a lot of memory (RAM) compared to standard python dict, e.g. a few GB instead of a few 100s of MB. This gets problematic for somewhat large datasets (e.g. 1M queries)

Describe the solution you'd like
I guess it's related to Numba representation? I've no clue on how to make it more efficient, sorry :)

Reproduce
Just open your system monitor and see how the memory grows.

In [1]: import ranx
# this weighs only a few 100s of MB
In [2]: run_d = {str(i): {str(j): 0.0 for j in range(100)} for i in range(100000)}
# this grows to a few GB
In [3]: run_r = ranx.Run(run_d)

Best,

Paul

[Feature Request] Support gzipped files?

Is your feature request related to a problem? Please describe.
trec files can be several megabytes, for example run_*.trec used for the examples are all more than 20Mb, but once compressed they become less than 10. That would make downloads faster and also loading the file in memory.

Describe the solution you'd like
support *.trec.gz.

Describe alternatives you've considered
It would cool to evaluate also alternative formats to store the trec file, like [parquet] (https://arrow.apache.org/docs/python/parquet.html), this library focus on computing metrics fast, but if you spend ages to load/parse the trec file it is not very useful - parquet is much faster to load in memory and it is supports compression natively.

Additional context
πŸ’¨

Qrels and Run query ids do not match

Describe the bug
I am evaluating my test set against my algorithm recommendations, but it gives the error
Qrels and Run query ids do not match

To Reproduce
Steps to reproduce the behavior:
metrics = ["recall@10", "mrr@10","ndcg@10"]
person_date_indexs = df_recs_train_top10['person_date_index'].unique()
run = tranform_recs_to_ranx(df_recs_train_top10, person_date_indexs, "person_date_index", "gym_id", "rank")
qrels = tranform_test_to_ranx(df_test_checkins_new_col_renamed, df_recs_train_top10)
evaluate(qrels, run, metrics)

Expected behavior
print the recall, mrr and ndcg

Screenshots
image

Desktop (please complete the following information):
Ubuntu 22.04

Additional context
It worked fine with another test set. The only difference is that in this test set I removed items that the user already interacted in train.

[BUG] `MRR@1` is not equal `Recall@1`

Describe the bug
MRR@1 should be equal to Recall@1. However, these metrics diverge for the case below.

To Reproduce

%%capture
!pip install ranx

from ranx import Qrels, Run, evaluate
import pickle

# download files from https://drive.google.com/drive/folders/1ZLyB6mKKiQsypw36nhdZ4dGqmFw27K-3?usp=sharing
with open("qrels.pkl", "rb") as f:
    qrels = pickle.load(f)
with open("run.pkl", "rb") as f:
    run = pickle.load(f)

evaluate(
    Qrels(qrels),
    Run(run),
    ['mrr@1', 'mrr@5', 'mrr@10', 'recall@1', 'recall@5', 'recall@10'])


# {'mrr@1': 0.8133879123525163,
#  'mrr@5': 0.820242395055783,
#  'mrr@10': 0.8206332007078454,
#  'recall@1': 0.04814167526511499,
#  'recall@5': 0.05089848464127321,
#  'recall@10': 0.05171913427724859}

or use Google Colab.

Expected behavior
mrr@1=recall@1

Am I missing something?

Spearman, Kendall correlation functions

Hi! This lib is extremely good tool to have in arsenal, but I think it could be nice to have Spearman and Kendall correlation functions included to this package to evaluate ranking. Maybe not the most popular metrics, but sometimes they could come in handy.
Best regards,
Ivan Savchuk

[Feature Request] infer run/qrels format from file extension in `from_file`

Is your feature request related to a problem? Please describe.
I think it’s quite frustrating to have to specify the format of qrels/run (TREC or JSON). I often get exceptions if I forget to specify the 'trec' format because JSON is default.

Describe the solution you'd like
You could infer the format from the file extension: if .json then JSON else TREC. I can open a PR if you’re interested :)

[BUG] win_tie_loss in Report.to_dict

Describe the bug

When converting a Report to dict, you only keep one metric while iterating over metrics (overwriting the previous metric in each loop)
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L315

How to fix
replace the line above with:
d[m1]["win_tie_loss"][m2][metric] = self.win_tie_loss[(m1, m2)][metric]
and init d[m1]["win_tie_loss"][m2] = {} at the same place as
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L309 (just above)

Integer relevance score

Hi,

Thank you for this nice library.

Is there a fundamental reason why relevance score need to be integers in Qrels?

Thanks.

add an option to disable sort_dict_of_dict_by_value when adding results to a run

Hi -- guy with the weird feature requests here πŸ˜… --

Motivation

You don’t want to ask, but, I have some use case where all the documents returned by my system have the same score, however the order matters!
And, when you add_and_sort documents to a run, you end up applying sort_dict_of_dict_by_value, which might reverse the order or completely shuffle the order of document ids:

In [1]: from ranx import Qrels, Run, evaluate

In [2]: run = Run()
   ...: run.add_multi(
   ...:     q_ids=["q_1", "q_2"],
   ...:     doc_ids=[
   ...:         ["doc_12", "doc_23", "doc_25", "doc_36", "doc_32", "doc_35"],
   ...:         ["doc_12", "doc_11", "doc_25", "doc_36", "doc_2",  "doc_35"],
   ...:     ],
   ...:     scores=[
   ...:         [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   ...:         [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   ...:     ],
   ...: )
In [3]: list(run.run['q_1'].keys())
Out[3]: ['doc_35', 'doc_32', 'doc_36', 'doc_25', 'doc_23', 'doc_12']

Solution

Obviously, my system could add a slightly negative number to preserve the order of documents, however, this is more of a pain to me than commenting this line.

The request

Would you be be willing to add an option to disable sort_dict_of_dict_by_value when calling add_multi?

Thanks for the quick response on my other issues :)

[Feature Request] custom fusion method in optimize_fusion

Is your feature request related to a problem? Please describe.
Hi, you’ve done a great job implementing plenty of different fusion algorithms, but I think it will always be a bottleneck.
What would you think about letting the user define their own training function?

Describe the solution you'd like
For example, in optimize_fusion, allow method to be a callable and in this case, do not call has_hyperparamsΒ and optimization_switch.

Describe alternatives you've considered

  • Open a feature request every time I want to try out something new :)
  • Fork ranx and implement new fusion methods there

My use case/ Ma et al.
By the way, at the moment, my use case is to use the default-minimum trick of Ma et al.: when combining results from systems A and B, it consists in giving the minimum score of A's results if a given document was only retrieved by system B, and vice-versa.

Maybe this is already possible in ranx via some option/method named differently? Or maybe you’d like to add it in the core ranx fusion algorithms?

[BUG] dcg and dcg_burges do not work in the compare function

Describe the bug

when adding the newly available dcg or dcg_burges metric in the compare function I get this error:

report = compare(
    qrels=qrels,
    runs=runs,
    metrics=["recall@10","ndcg","rbp.90","rbp.50","dcg_burges"],
    max_p=0.05,   # P-value threshold
    stat_test='student'
)
Traceback (most recent call last):
  File "/local/home/mkp/data/gap2kic/eval/./run.py", line 32, in <module>
    print(report)
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 338, in __str__
    return self.to_table()
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 143, in to_table
    label = self.get_metric_label(x)
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 122, in get_metric_label
    return f"{metric_labels[m]}"
KeyError: 'dcg_burges'

however in the same file when I do

res = evaluate(qrels, run, ["recall@10","ndcg","rbp.90","rbp.50","dcg","dcg_burges"])

everything works as intended.

PSP@k: Propensity-scored precision at k

I want to implement the propensity-scored precision at k (PSP@k) as defined above:

$PSP@k = \frac{1}{k} \sum \frac{y_i}{p_i}$

where $p_i$ is the propensity of $y_i$ and $1 \leq i \leq k$.

Therefore, how could I integrate this metric in ranx?

References:

[1] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., 2021. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34, pp.7267-7280.

[BUG] Issues when storing/loading Qrels from a dataframe and a parquet file.

Describe the bug
Bug when reconstructing Qrels from a pandas dataframe. This bug affects also when reading a Qrel from a parquet file as the pandas to Qrels is used internally.

Pandas version: 1.5.2
Ranx: last pip version

To Reproduce
Code:

from ranx import Qrels
qrels = Qrels({'1':{'1':1}, '2':{'2': 1}})
df = qrels.to_dataframe()
Qrels.from_df(df)

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\ranx\data_structures\qrels.py", line 300, in from_df
    assert df[score_col].dtype == int, "DataFrame scores column dtype must be `int`"
AssertionError: DataFrame scores column dtype must be `int`

About the dataframe dtypes
It is using int64 instead of int

>>> df.dtypes
q_id      object
doc_id    object
score      int64
dtype: object

>>> import numpy as np
>>> df.dtypes['score'] == np.int64
True

set Report’s rounding_digits in compare

Hi,

compare does not have a rounding_digits argument and thus always uses the default from Report (which is 4). Why is that?

Also, would you like to add an option in Report to print results as percentages rather than ratios ?

Incorrect result for f1 score

Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.

using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.

from ranx.metrics.f1 import _f1_parallel, _f1, f1

f1(_qrels, _run, 1, 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

_f1_parallel(_qrels, _run, 1 , 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

But if I call _f1 for each item in qrels, I get correct F1 score for all queries.

scores = []

for i in prange(len(_qrels)):
  try:
      scores.append(_f1(_qrels[i], _run[i], 1 , 1))
  except Exception as error:
    # handle the exception
    print(f" {i} An exception occurred:", error)
    scores.append(0)
    continue

# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]

Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.

More context -

I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444

also, precision@1 = 0.944 for my case.

Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}

f1@1 seems very low considering precision & recall @ 1 are equal with high value.

Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .

{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}

Thanks in advance.
Let me me know if I am missing something.

[Feature Request] from ranx import Report

Is your feature request related to a problem? Please describe.
Hi! I’d like to be able to import Report so that I can load a previously saved report (output of compare) and tweak the runs.

Describe the solution you'd like
from .report import Report

Describe alternatives you've considered
Re-run compare with different runs πŸ˜…

[BUG] ttest_rel() got an unexpected keyword argument 'alternative' when using compare with stat_test="student"

Describe the bug
Hi, I’m having an error when using compareΒ with stat_test="student" (no problem when using the default "fisher").

TypeError                                 Traceback (most recent call last)
<ipython-input-3-0369b81922de> in <module>
      7     metrics=["map@100", "mrr@100", "ndcg@10"],
      8     stat_test="student",
----> 9     max_p=0.01  # P-value threshold
     10 )

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/meta/compare.py in compare(qrels, runs, metrics, stat_test, n_permutations, max_p, random_seed, threads, rounding_digits, show_percentages)
    100         n_permutations=n_permutations,
    101         max_p=max_p,
--> 102         random_seed=random_seed,
    103     )
    104 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in compute_statistical_significance(model_names, metric_scores, stat_test, n_permutations, max_p, random_seed)
     81                         n_permutations,
     82                         max_p,
---> 83                         random_seed,
     84                     )
     85 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in _compute_statistical_significance(control_metric_scores, treatment_metric_scores, stat_test, n_permutations, max_p, random_seed)
     38         elif stat_test == "student":
     39             p_value, significant = paired_student_t_test(
---> 40                 control_metric_scores[m], treatment_metric_scores[m], max_p,
     41             )
     42 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/paired_student_t_test.py in paired_student_t_test(control, treatment, max_p)
     11 
     12     """
---> 13     _, p_value = ttest_rel(control, treatment, alternative="two-sided")
     14 
     15     return p_value, p_value <= max_p

TypeError: ttest_rel() got an unexpected keyword argument 'alternative'

To Reproduce

In [1]: from ranx import Qrels, Run
   ...: 
   ...: qrels_dict = { "q_1": { "d_12": 5, "d_25": 3 },                                                                                                                                                   
   ...:                "q_2": { "d_11": 6, "d_22": 1 } }
   ...: 
   ...: run_dict = { "q_1": { "d_12": 0.9, "d_23": 0.8, "d_25": 0.7,
   ...:                       "d_36": 0.6, "d_32": 0.5, "d_35": 0.4  },
   ...:              "q_2": { "d_12": 0.9, "d_11": 0.8, "d_25": 0.7,
   ...:                       "d_36": 0.6, "d_22": 0.5, "d_35": 0.4  } }
   ...:                                                                                                                                                                                                   
   ...: qrels = Qrels(qrels_dict)
   ...: run = Run(run_dict)
In [2]: from ranx import compare                                                                                                                                                                           
   ...:                                                                                                                                                                                                    
   ...: # Compare different runs and perform statistical tests                                                                                                                                             
   ...: report = compare(                                                                                                                                                                                  
   ...:     qrels=qrels,                                                                                                                                                                                   
   ...:     runs=[run, run],                                                                                                                                                                               
   ...:     metrics=["map@100", "mrr@100", "ndcg@10"],                                                                                                                                                     
   ...:     stat_test="student",                                                                                                                                                                           
   ...:     max_p=0.01  # P-value threshold
   ...: )

Versions
ranx==0.2.8

Issue with MRR

Attempting to the MRR with your example and am getting a Typing error.

System:

  • python 3.8.2
from rank_eval import ndcg, mrr
import numpy as np

# Note that y_true does not need to be ordered
# Integers are documents IDs, while floats are the true relevance scores
y_true = np.array([[[12, 0.5], [25, 0.3]], [[11, 0.4], [2, 0.6]]])
y_pred = np.array(
    [
        [[12, 0.9], [234, 0.8], [25, 0.7], [36, 0.6], [32, 0.5], [35, 0.4]],
        [[12, 0.9], [11, 0.8], [25, 0.7], [36, 0.6], [2, 0.5], [35, 0.4]],
    ]
)
k = 5

mrr(y_true, y_pred, k, threads=1)

Error message


---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-24-7d69c81d4a8f> in <module>
     13 k = 5
     14 
---> 15 mrr(y_true, y_pred, k, threads=1)

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in mrr(y_true, y_pred, k, return_mean, sort, threads)
    482     """
    483 
--> 484     return _choose_optimal_function(
    485         y_true=y_true,
    486         y_pred=y_pred,

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in _choose_optimal_function(y_true, y_pred, f_name, f_single, f_parallel, f_additional_args, return_mean, sort, threads)
    234         if sort:
    235             y_pred = _descending_sort_parallel(y_pred)
--> 236         scores = f_parallel(y_true, y_pred, **f_additional_args)
    237         if return_mean:
    238             return np.mean(scores)

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    399                 e.patch_message(msg)
    400 
--> 401             error_rewrite(e, 'typing')
    402         except errors.UnsupportedError as e:
    403             # Something unsupported is present in the user code, add help info

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    342                 raise e
    343             else:
--> 344                 reraise(type(e), e, None)
    345 
    346         argtypes = []

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     77         value = tp()
     78     if value.__traceback__ is not tb:
---> 79         raise value.with_traceback(tb)
     80     raise value
     81 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<built-in function contains>) with argument(s) of type(s): (array(float64, 1d, A), float64)
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with literals.
In definition 5:
    All templates rejected without literals.
In definition 6:
    All templates rejected with literals.
In definition 7:
    All templates rejected without literals.
In definition 8:
    All templates rejected with literals.
In definition 9:
    All templates rejected without literals.
In definition 10:
    All templates rejected with literals.
In definition 11:
    All templates rejected without literals.
In definition 12:
    All templates rejected with literals.
In definition 13:
    All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (78)

File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 78:
def _reciprocal_rank(y_true, y_pred, k):
    <source elided>
    for i in range(k):
        if y_pred[i, 0] in y_true[:, 0]:
        ^

[1] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)

[3] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)


File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 11:
def _parallelize(f, y_true, y_pred, k):
    <source elided>
    for i in prange(len(y_true)):
        scores[i] = f(y_true[i], y_pred[i], k)
        ^

[1] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)

[3] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)


File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 85:
def _mrr(y_true, y_pred, k):
    return _parallelize(_reciprocal_rank, y_true, y_pred, k)
    ^

[BUG] RBP with multiple relevance levels

Describe the bug

Given the formula of RBP of

    \operatorname{RBP} = (1 - p) \cdot \sum_{i=1}^{d}{r_i \cdot p^{i - 1}}

where r_i is the reward/utility, RBP should support multiple relevance levels similar to DCG such that if max relevance level is 2 max rbp value should be 2

question: Comparing models with multiple runs

First of all, great work on this code. I have been looking for a definitive package to evaluate ranking models and I believe this is that package.

My question is perhaps a bit out of the domain, but it could help others in the future. How would you deal with comparing 2 models where each has multiple runs (e.g., runs with different random initialization and/or batch shuffling, for confidence intervals). I was thinking that perhaps the significance testing could be performed between the mean (across runs) metric_scores vectors.

Thanks in advance,

Tiago

[BUG] Could not find a version that satisfies the requirement ranx

Describe the bug
Could not find a version that satisfies the requirement ranx (pip3 install ranx)

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

pip 21.3.1 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

Error message:

ERROR: Could not find a version that satisfies the requirement ranx (from versions: none)
ERROR: No matching distribution found for ranx

[Feature Request] Expose DCG as metric

In industry DCG (in both formulations) is a standard and widely used metric. I see it is already implemented as part of NDCG. Is it possible to expose it to the users as a real metric?

Zero-scored documents

@mponza found a bug when computing recall and promise to document it next week, adding a reminder here for him. Something related to multiple documents having score zero.

Ranx pip installing failed

Describe the bug
Error during pip install.

To Reproduce
pip install ranx==0.3.2

Bash output

(xCoFormer) celso@capri:~/projects/xCoFormer$ pip install ranx==0.3.2
Collecting ranx==0.3.2
  Using cached ranx-0.3.2-py3-none-any.whl (93 kB)
Requirement already satisfied: numpy in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.22.4)
Requirement already satisfied: tqdm in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (4.64.1)
Requirement already satisfied: scipy>=1.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.9.3)
Collecting lz4
  Using cached lz4-4.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting cbor2
  Using cached cbor2-5.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (224 kB)
Collecting ir-datasets
  Using cached ir_datasets-0.5.4-py3-none-any.whl (311 kB)
Collecting statsmodels
  Using cached statsmodels-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.0 MB)
Requirement already satisfied: pandas in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.4.4)
Requirement already satisfied: rich in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (12.6.0)
Collecting orjson
  Using cached orjson-3.8.0-cp310-cp310-manylinux_2_28_x86_64.whl (146 kB)
Requirement already satisfied: tabulate in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (0.9.0)
Collecting numba>=0.54.1
  Using cached numba-0.56.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.6 MB)
Requirement already satisfied: setuptools in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from numba>=0.54.1->ranx==0.3.2) (59.6.0)
Requirement already satisfied: zlib-state>=0.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.1.5)
Collecting lxml>=4.5.2
  Using cached lxml-4.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
Requirement already satisfied: ijson>=3.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (3.1.4)
Requirement already satisfied: trec-car-tools>=2.5.4 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.6)
Requirement already satisfied: requests>=2.22.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.28.1)
Requirement already satisfied: pyyaml>=5.3.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (6.0)
Collecting pyautocorpus>=0.1.1
  Using cached pyautocorpus-0.1.8.tar.gz (10 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: unlzw3>=0.2.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.1)
Collecting beautifulsoup4>=4.4.1
  Using cached beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
Requirement already satisfied: warc3-wet>=0.2.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.3)
Requirement already satisfied: warc3-wet-clueweb09>=0.2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.5)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2022.5)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (2.13.0)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (0.9.1)
Collecting patsy>=0.5.2
  Using cached patsy-0.5.3-py2.py3-none-any.whl (233 kB)
Requirement already satisfied: packaging>=21.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from statsmodels->ranx==0.3.2) (21.3)
Requirement already satisfied: soupsieve>1.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from beautifulsoup4>=4.4.1->ir-datasets->ranx==0.3.2) (2.3.2.post1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from packaging>=21.3->statsmodels->ranx==0.3.2) (3.0.9)
Requirement already satisfied: six in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from patsy>=0.5.2->statsmodels->ranx==0.3.2) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2.1.1)
Requirement already satisfied: cbor>=1.0.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from trec-car-tools>=2.5.4->ir-datasets->ranx==0.3.2) (1.0.0)
Building wheels for collected packages: pyautocorpus
  Building wheel for pyautocorpus (setup.py) ... error
  error: subprocess-exited-with-error
  
  Γ— python setup.py bdist_wheel did not run successfully.
  β”‚ exit code: 1
  ╰─> [13 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'pyautocorpus' extension
      creating build
      creating build/temp.linux-x86_64-3.10
      creating build/temp.linux-x86_64-3.10/src
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
      src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
         40 | #include <pcre.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyautocorpus
  Running setup.py clean for pyautocorpus
Failed to build pyautocorpus
Installing collected packages: pyautocorpus, patsy, orjson, lz4, lxml, llvmlite, cbor2, beautifulsoup4, numba, ir-datasets, statsmodels, ranx
  Running setup.py install for pyautocorpus ... error
  error: subprocess-exited-with-error
  
  Γ— Running setup.py install for pyautocorpus did not run successfully.
  β”‚ exit code: 1
  ╰─> [15 lines of output]
      running install
      /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_ext
      building 'pyautocorpus' extension
      creating build
      creating build/temp.linux-x86_64-3.10
      creating build/temp.linux-x86_64-3.10/src
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
      src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
         40 | #include <pcre.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Γ— Encountered error while trying to install package.
╰─> pyautocorpus

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Env:

  • OS: Ubuntu 20.04
  • Python 3.10.6

question: why student rather than fisher stat test?

Hi,

Just a quick question: I wondered what motivated the choice for changing the default value of the stat test to student instead of fisher 0dc8d9c (I almost published them as is before figuring it out πŸ˜…).

I thought that one of your documentation pointed me to this paper of Smucker et al. that suggests using Fisher (and especially not Student), but maybe I don’t recall correctly.

Btw, the docstring still shows 'fisher' as default https://amenra.github.io/ranx/compare/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.