Hi, Thanks for your work. I want to run <a href="https://github.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Potential bugs in evaluate_kilt_dataset.py,about facebookresearch/genre

Comments (10)

hitercs commented on June 4, 2024 3

Oh! sorry, I didn't sync your commits in my server. It gets normal now. Good job! Thanks.

from genre.

nicola-decao commented on June 4, 2024 1

@hitercs No I cannot reproduce this. Here is what I have running

python scripts_genre/evaluate_kilt_dataset.py \
   models/fairseq_entity_disambiguation_aidayago \
   datasets/msnbc-test-kilt.jsonl \
   datasets/msnbc-test-kilt-out.jsonl \
   --candidates \
   --batch_size 16 \
   -d -v

INFO:root:Loading model
INFO:fairseq.file_utils:loading archive file models/fairseq_entity_disambiguation_aidayago
INFO:fairseq.tasks.translation:[source] dictionary: 50264 types
INFO:fairseq.tasks.translation:[target] dictionary: 50264 types
INFO:root:Loading datasets/msnbc-test-kilt.jsonl
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 656/656 [01:03<00:00, 10.35it/s, f1=0.941, prec=0.945, rec=0.938]
INFO:root:Saving dataset in datasets/msnbc-test-kilt-out.jsonl
+-----------------+-------+-----------+--------+-------------+----------+
|     Dataset     |   F1  | Precision | Recall | R-precision | Recall@5 |
+-----------------+-------+-----------+--------+-------------+----------+
| msnbc-test-kilt | 94.26 |   94.62   | 93.90  |    93.90    |  97.41   |
+-----------------+-------+-----------+--------+-------------+----------+

from genre.

nicola-decao commented on June 4, 2024 1

@hitercs yes, I'll update it in an hour since I'm fixing other stuff as well

from genre.

nicola-decao commented on June 4, 2024

I managed to reproduce your bug. Working on it.

from genre.

nicola-decao commented on June 4, 2024

@hitercs Now batches work again! Regarding a mismatch in scores: I am trying other datasets (eg in MSNBC I got the same score as in the paper but not for ACE2004 as you reported). One possibility that is very likely is the following:

I used a version of fairseq from 6 months ago. There were some breaking changes and it might that the code of the BART model or the beam search changed slightly leading to different results.
For the results in the paper I used an internal (private) Facebook AI version of fairseq so the code might also be slightly different there.

I hope it helps and thanks for reporting the but! I really appreciated it 😊

from genre.

hitercs commented on June 4, 2024

@nicola-decao Thanks.

However, when I run with
--candidates --batch_size 16, the performance is still very low.
It is normal without --candidates

from genre.

nicola-decao commented on June 4, 2024

@hitercs are you using the version of fairseq version I indicated in the example?

from genre.

hitercs commented on June 4, 2024

sorry, I meant when specifying the argument --candidates, the bugs seems still there. Can you get normal performance using --candidates --batch_size 16? Yes, I use the fairseq version in your provided repo.

from genre.

hitercs commented on June 4, 2024

By the way, one more minor bug report:

GENRE/scripts_genre/evaluate_kilt_dataset.py

Line 257 in 130703b

trie = pickle.load(f)

Should it be?
trie = Trie.load_from_dict(pickle.load(f))

from genre.

Saibo-creator commented on June 4, 2024

@hitercs No I cannot reproduce this. Here is what I have running

python scripts_genre/evaluate_kilt_dataset.py \
   models/fairseq_entity_disambiguation_aidayago \
   datasets/msnbc-test-kilt.jsonl \
   datasets/msnbc-test-kilt-out.jsonl \
   --candidates \
   --batch_size 16 \
   -d -v

INFO:root:Loading model
INFO:fairseq.file_utils:loading archive file models/fairseq_entity_disambiguation_aidayago
INFO:fairseq.tasks.translation:[source] dictionary: 50264 types
INFO:fairseq.tasks.translation:[target] dictionary: 50264 types
INFO:root:Loading datasets/msnbc-test-kilt.jsonl
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 656/656 [01:03<00:00, 10.35it/s, f1=0.941, prec=0.945, rec=0.938]
INFO:root:Saving dataset in datasets/msnbc-test-kilt-out.jsonl
+-----------------+-------+-----------+--------+-------------+----------+
|     Dataset     |   F1  | Precision | Recall | R-precision | Recall@5 |
+-----------------+-------+-----------+--------+-------------+----------+
| msnbc-test-kilt | 94.26 |   94.62   | 93.90  |    93.90    |  97.41   |
+-----------------+-------+-----------+--------+-------------+----------+

Why would Precision and Recall differ here ? If I remember correctly, the ED task only has a single gold target and you always take the top 1 prediction to evaluate. In that case, wouldn't we always have recall = precision = f1 = accuracy ?
Or is there any detail that I'm getting wrong ...

from genre.

Potential bugs in evaluate_kilt_dataset.py about genre HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent