Git Product home page Git Product logo

genre's Introduction

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@inproceedings{decao2021autoregressive,
  author    = {Nicola {De Cao} and
               Gautier Izacard and
               Sebastian Riedel and
               Fabio Petroni},
  title     = {Autoregressive Entity Retrieval},
  booktitle = {9th International Conference on Learning Representations, {ICLR} 2021,
               Virtual Event, Austria, May 3-7, 2021},
  publisher = {OpenReview.net},
  year      = {2021},
  url       = {https://openreview.net/forum?id=5k8F6UU39V},
}

The mGENRE system as presented in Multilingual Autoregressive Entity Linking

@article{de-cao-etal-2022-multilingual,
    title = "Multilingual Autoregressive Entity Linking",
    author = "De Cao, Nicola  and
      Wu, Ledell  and
      Popat, Kashyap  and
      Artetxe, Mikel  and
      Goyal, Naman  and
      Plekhanov, Mikhail  and
      Zettlemoyer, Luke  and
      Cancedda, Nicola  and
      Riedel, Sebastian  and
      Petroni, Fabio",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.16",
    doi = "10.1162/tacl_a_00460",
    pages = "274--290",
}

Please consider citing our works if you use code from this repository.

In a nutshell, (m)GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture or mBART (for multilingual). (m)GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annotated with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

mGENRE performs multilingual entity linking in 100+ languages treating language as latent variables and marginalizing over them:

Main dependencies

  • python>=3.7
  • pytorch>=1.6
  • fairseq>=0.10 (optional for training GENRE) NOTE: fairseq is going though changing without backward compatibility. Install fairseq from source and use this commit for reproducibilty. See here for the current PR that should fix fairseq/master.
  • transformers>=4.2 (optional for inference of GENRE)

Examples & Usage

For a full review of (m)GENRE API see:

GENRE

After importing and loading the model and a prefix tree (trie), you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

import pickle

from genre.fairseq_model import GENRE
from genre.trie import Trie

# load the prefix tree (trie)
with open("../data/kilt_titles_trie_dict.pkl", "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))

# load the model
model = GENRE.from_pretrained("models/fairseq_entity_disambiguation_aidayago").eval()

# generate Wikipedia titles
model.sample(
    sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."],
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
[[{'text': 'Germany', 'score': tensor(-0.1856)},
  {'text': 'Germans', 'score': tensor(-0.5461)},
  {'text': 'German Empire', 'score': tensor(-2.1858)}]

mGENRE

Making predictions with mGENRE is very similar, but we additionally need to map (title, language_ID) to Wikidata IDs and (optionally) marginalize over predictions of the same entity:

import pickle

from genre.fairseq_model import mGENRE
from genre.trie import MarisaTrie, Trie

with open("../data/lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
    lang_title2wikidataID = pickle.load(f)

# memory efficient prefix tree (trie) implemented with `marisa_trie`
with open("../data/titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
    trie = pickle.load(f)

# generate Wikipedia titles and language IDs
model = mGENRE.from_pretrained("../models/fairseq_multilingual_entity_disambiguation").eval()

model.sample(
    sentences=["[START] Einstein [END] era un fisico tedesco."],
    # Italian for "[START] Einstein [END] was a German physicist."
    prefix_allowed_tokens_fn=lambda batch_id, sent: [
        e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
    ],
    text_to_id=lambda x: max(lang_title2wikidataID[
        tuple(reversed(x.split(" >> ")))
    ], key=lambda y: int(y[1:])),
    marginalize=True,
)
[[{'id': 'Q937',
   'texts': ['Albert Einstein >> it',
    'Alberto Einstein >> it',
    'Einstein >> it'],
   'scores': tensor([-0.0808, -1.4619, -1.5765]),
   'score': tensor(-0.0884)},
  {'id': 'Q60197',
   'texts': ['Alfred Einstein >> it'],
   'scores': tensor([-1.4337]),
   'score': tensor(-3.2058)},
  {'id': 'Q15990626',
   'texts': ['Albert Einstein (disambiguation) >> en'],
   'scores': tensor([-1.0998]),
   'score': tensor(-3.6478)}]]

Models & Datasets

For GENRE use this script to download all models and this to download all datasets. See here the list of all individual models for each task and for both pytorch fairseq and huggingface transformers. See the example on how to download additional optional files like the prefix tree (trie) for KILT Wikipedia.

For mGENRE we only have a model available here. See the example on how to download additional optional files like the prefix tree (trie) for Wikipedia in all languages and the mapping between titles and Wikidata IDs.

Pre-trained mBART model on 125 languages available here.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

genre's People

Contributors

fabiopetroni avatar nicola-decao avatar ynouri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genre's Issues

Are default arguments for create_input function missing?

https://github.com/facebookresearch/GENRE/tree/main/examples_genre#pre-processing

Following the steps above, I ran the preprocessing as shown in the following command and encountered an error.

$ python scripts_genre/convert_kilt_to_fairseq.py datasets/blink-dev-kilt.jsonl preprocessed_datasets/
INFO:root:Loading datasets/blink-dev-kilt.jsonl
Processing:   0%|                                                                                | 0/10000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "scripts_genre/convert_kilt_to_fairseq.py", line 80, in <module>
    source, target = convert_kilt_to_fairseq(
  File "scripts_genre/convert_kilt_to_fairseq.py", line 29, in convert_kilt_to_fairseq
    source.append(create_input(doc, max_length=384))
TypeError: create_input() missing 2 required positional arguments: 'start_delimiter' and 'end_delimiter'

https://github.com/facebookresearch/GENRE/blob/main/genre/utils.py#L45

It seems to me that the start_delimiter and end_delimiter in the create_input function are correct to have default arguments.
I would appreciate it if you could advise me if my execution procedure is wrong.

Pre-trained mBart-125

Hi there,

I was wondering if there are any plans to release the pre-trained mBart on cc100 in 125 languages. I believe it can be a useful resource for other multilingual tasks as currently the highest public checkpoint coverage I believe is 50 languages with mBart-50.

Sorry if this has been answered before or maybe the model is released somewhere and I missed it, but I couldn't find any info about this so decided to ask here.

Thanks!

Disambiguate "George W. Bush" and "George H. W. Bush"

To correctly disambiguate the two US presidents, "George W. Bush" and "George H. W. Bush"
I have tried several approaches using hf_e2e_entity_linking_aidayago and hf_e2e_entity_linking_wiki_abs models:

# Example: End-to-End Entity Linking
# wikipedia aidayago
model = GENRE.from_pretrained(os.path.join(cache_dir,"hf_e2e_entity_linking_aidayago")).eval()
# or wikipedia
wiki_model = GENRE.from_pretrained(os.path.join(cache_dir,"hf_e2e_entity_linking_wiki_abs")).eval()

w/ mention_trie, mention_to_candidates_dict

sentences = ["George Bush was the 43rd president of United States"]
entity_spans = get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
     mention_to_candidates_dict={
        "George Bush": ["George W. Bush", "George H. W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

Result: WRONG 👎🏾

[George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 43rd president of United States
sentences = ["George Bush was the 43rd president of United States"]
entity_spans = get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
    mention_to_candidates_dict={
        "George Bush": ["George W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

Result: CORRECT 👍🏾 🥇

[George Bush](https://en.wikipedia.org/wiki/George_W._Bush) was the 43rd president of United States

w/ candidates_trie

sentences = ["George Bush was the 43rd president of the United States from 2001 to 2009"]

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(
    model,
    sentences,
    candidates_trie=Trie([
        model.encode(" }} [ {} ]".format(e))[1:].tolist()
        for e in ["George W. Bush", "George H. W. Bush"]
    ])
)
out = model.sample(
    sentences,
    prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
)
print(out)

Result:WRONG (+ possible bug...) 👎🏾

[[{'text': 'George { Bush } [ George H. W. Bush ] was the 43rd president of the United States from 2001 to 2009', 'logprob': tensor(-0.7810)}], [{'text': 'George { Bush } [ George H. W. Bush ] was the 43rd president of the United States from 2001 to { 2009 } [ and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and ...

but using

sentences = ["George Bush was the 41rd president of the United States from 1989 to 2003"]
entity_spans = get_entity_spans(
    wiki_model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
    mention_to_candidates_dict={
        "George Bush": ["George W. Bush", "George H. W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

CORRECT Result: 👍🏾 🥇

[George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 41rd president of the United States from 2001 to 2009

and

sentences = ["George Bush was the 43rd president of the United States from 2001 to 2009"]
entity_spans = get_entity_spans(
    wiki_model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
    mention_to_candidates_dict={
        "George Bush": ["George W. Bush", "George H. W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

WRONG Result: 👎🏾

[George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 43rd president of the United States from 2001 to 2009

I therefore approached with kilt wikipedia trie and hf_entity_disambiguation_aidayago model

with open(os.path.join(cache_dir,"kilt_titles_trie_dict.pkl"), "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))
dmodel = GENRE.from_pretrained(os.path.join(cache_dir,"hf_entity_disambiguation_aidayago")).eval()
sentences = ["[START_ENT] George Bush [END_ENT] was the 43rd president of United States"]
out = dmodel.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
print(out)

WRONG Result 👎🏾

[[{'text': 'George H. W. Bush', 'logprob': tensor(-0.0866)}], [{'text': 'George W. Bush', 'logprob': tensor(-0.7922)}], [{'text': 'George H. W. Bush vomiting incident', 'logprob': tensor(-1.6464)}], [{'text': 'George H. W. Bush 1988 presidential campaign', 'logprob': tensor(-1.6795)}], [{'text': 'George H. W. Bush Supreme Court candidates', 'logprob': tensor(-2.1032)}]]

and for the model hf_wikipage_retrieval:

smodel = GENRE.from_pretrained(os.path.join(cache_dir,"hf_wikipage_retrieval")).eval()
sentences = ["[START_ENT] George Bush [END_ENT] was the 43rd president of United States"]
out = smodel.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
print(out)

WRONG Result 👎🏾

[[{'text': 'George H. W. Bush', 'logprob': tensor(-0.0850)}], [{'text': 'George W. Bush', 'logprob': tensor(-0.7715)}], [{'text': 'George H. W. Bush Supreme Court candidates', 'logprob': tensor(-1.3834)}], [{'text': 'George H. W. Bush 1988 presidential campaign', 'logprob': tensor(-1.4211)}], [{'text': 'George H. W. Bush vomiting incident', 'logprob': tensor(-2.1070)}]]

So to recap, I have got only two cases when the best prediction is the desired one:

sentences = ["George Bush was the 43rd president of United States"]
entity_spans = get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
    mention_to_candidates_dict={
        "George Bush": ["George W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

Result: CORRECT 👍🏾 🥇

[George Bush](https://en.wikipedia.org/wiki/George_W._Bush) was the 43rd president of United States

and

sentences = ["George Bush was the 41rd president of the United States from 1989 to 2003"]
entity_spans = get_entity_spans(
    wiki_model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Bush"]
    ]),
    mention_to_candidates_dict={
        "George Bush": ["George W. Bush", "George H. W. Bush"]
    }
)
print(get_markdown(sentences, entity_spans)[0])

CORRECT Result: 👍🏾 🥇

[George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 41rd president of the United States from 2001 to 2009

Questions

  • Am I missing some other ways to disambiguate?
  • Assumed that both George W. Bush and George H. W. Bush are in the knowledge graph, do the model needs more additional context like other mention_to_candidates_dict or mention_trie possibile values?
  • Supposed that one president (like George W. Bush) was missing, how to add his wikipedia entity page?
  • Are the "wrong" cases formally correct or am I missing something?

Thanks a lot!

Transformers inference requires fairseq

According to the documentation the transformers inference should not require fairseq. However, it appears that current version has a dependence on it when loading the title trie (and importing it has a possible version / environment issues).

Code to reproduce available on colab:
https://colab.research.google.com/drive/1Yo7pn-JhCaxGDs0lcGpbdpsgnd5qRY9z?usp=sharing

Error is below.

Executing:
import pickle
with open("/content/kilt_titles_trie.pkl", "rb") as f:
trie = pickle.load(f)
def prefix_allowed_tokens_fn(batch_id, sent):
return trie.get(sent.tolist())

Results in:
ImportError Traceback (most recent call last)
in ()
7
8 with open("/content/kilt_titles_trie.pkl", "rb") as f:
----> 9 trie = pickle.load(f)
10
11 def prefix_allowed_tokens_fn(batch_id, sent):

1 frames
/content/GENRE/genre/base_model.py in ()
10
11 import torch
---> 12 from fairseq import search, utils
13 from fairseq.models.bart import BARTHubInterface, BARTModel
14 from omegaconf import open_dict

ImportError: cannot import name 'search'

Google colab
Python 3.7.9

Note: This follows up on the issue: #6 that finds an incompatibility issue with versions for python 3.6.9.

I suspect there are also related colab environment issues with fairseq library versions as various install methods don't seem to resolve the issue. Any ideas?

Question about BPE and constrained decoding

I have been working with Bart, and want to use the constrained decoding approach implemented in GENRE (congrats on the work btw). However I have a question that by looking at the code (perhaps due to my lack of fairseq skills), I am not able to clarify. 

Due to the BPE tokenization of RoBERTa, and by extension Bart, the same words may be tokenized differently, ie. ["Nicola, ", "Nicola", "[ Nicola ]", " [ Nicola ],"] are tokenized quite differently ([[31988, 3019, 6, 1437], [31988, 3019], [10975, 14371, 27779], [646, 14371, 47720]]).

If we add tokens around the entities to mark them for training and perform the Entity Linking in GENRE, that would mean that when decoding, the tokens may not be the same once the "[]{}" tokens are added, ie. source and target show different tokens. Is there something I am missing to deal with this when using the constrained decoding which continues generation from the source tokens and may differ from what is seen at training in the target text?

Thanks!

Code of pre-train

Hi, thank you very much for the repo!
I want to reproduce your pre-training process (starting from BART), can you provide the code for the training process? If the code already exists in the repo, where can I find it?
I saw the description of run_bart_slurm.py and run_training.sh in /scripts_mgenre/README.md, but I did not see these two files.
Thanks again for your contribution!

'super' object has no attribute 'generate'... Issues following example...

Hey, I have been having a few issues with getting the basic example to work on colab (python version = 3.7.10)

If I install GENRE from the cloned repo and install fairseq from the branch you suggest I can't import GENRE from genre.fairseq_model.py:

from genre.fairseq_model import GENRE

gives

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-8d627c96b455> in <module>()
      1 import pickle
----> 2 from genre.fairseq_model import GENRE
      3 from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_fairseq as get_prefix_allowed_tokens_fn
      4 from genre.utils import get_entity_spans_fairseq as get_entity_spans
      5 model = GENRE.from_pretrained("../models/fairseq_e2e_entity_linking_aidayago").eval()

2 frames
/content/gdrive/MyDrive/entity_linking_demo/fairseq/fairseq/criterions/__init__.py in <module>()
     22     CRITERION_DATACLASS_REGISTRY,
     23 ) = registry.setup_registry(
---> 24     "--criterion", base_class=FairseqCriterion, default="cross_entropy"
     25 )
     26 

TypeError: cannot unpack non-iterable NoneType object

So then I tried installing GENRE from the cloned repo and !pip install fairseq and I got further

import pickle
import genre
from genre.trie import Trie
from genre.fairseq_model import GENRE
with open("kilt_titles_trie_dict.pkl", "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))
model = GENRE.from_pretrained("fairseq_entity_disambiguation_aidayago").eval()
model.sample(sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."])

Which gives...

1042301B [00:00, 1105195.62B/s]
456318B [00:00, 601446.35B/s]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-f9cae36978b8> in <module>()
      6     trie = Trie.load_from_dict(pickle.load(f))
      7 model = GENRE.from_pretrained("fairseq_entity_disambiguation_aidayago").eval()
----> 8 model.sample(sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."])

1 frames
/content/drive/MyDrive/entity_linking_demo/GENRE/genre/fairseq_model.py in sample(self, sentences, beam, verbose, text_to_id, marginalize, marginalize_lenpen, max_len_a, max_len_b, **kwargs)
     41             max_len_a=max_len_a,
     42             max_len_b=max_len_b,
---> 43             **kwargs,
     44         )
     45         outputs = [

/content/drive/MyDrive/entity_linking_demo/GENRE/genre/fairseq_model.py in generate(self, *args, **kwargs)
     90 
     91     def generate(self, *args, **kwargs) -> List[List[Dict[str, torch.Tensor]]]:
---> 92         return super(BARTHubInterface, self).generate(*args, **kwargs)
     93 
     94 

AttributeError: 'super' object has no attribute 'generate'

I'm probably doing something pretty stupid, but just trying to follow the examples as I read them. I got the same TypeError: super()... when I tried a couple of different models. Any advice?

List of Languages Supported?

Hello! I'm scanning the ArXiV paper right now for it, and can't seem to figure out if there's a canonical list of languages served by this current implementation (i.e. which languages can the model "read" in order to extract meaningful entities encoded between [START] and [END] tokens). Do you have the list of languages which can be parsed by the mGENRE model?

get_entity_spans and trailing spaces

Hello,

thank you very much for the tool!

I noticed that .get_entity_spans sometimes fails (on the contrary, .sample works well). It has something to do with trailing spaces.

  • Example 1:

    from genre.fairseq_model import GENRE
    from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_fairseq as get_prefix_allowed_tokens_fn
    from genre.utils import get_entity_spans_fairseq as get_entity_spans
    
    
    model = GENRE.from_pretrained("../models/fairseq_e2e_entity_linking_aidayago").eval()
    
    sentences = ['This transition consists of moving from an energy system mainly based on fossil fuels to an energy system based on low-carbon sources, especially renewable ones.']
    
    get_entity_spans(
        model,
        sentences)
    

    Output error (no "text" in the list):

    output_sentences = get_entity_spans_post_processing(
             [e[0]["text"] for e in output_sentences]
         )
    IndexError: list index out of range
    

    If I modify this line and don't add trailing spaces, the example 1 works.

  • Example 2 (less critical):

    # a sentence with a trailing space at the end
    sentences = ['The legal environment is adapting to keep up with the evolution of technologies and our societies (increased use of digital technology, growth of online commerce, etc.). ']
    
    get_entity_spans(
        model,
        sentences)
    

    Output error (sent is a list):

    in get_entity_spans_post_processing
        sent = re.sub(r"{.*?", "{ ", sent)
      File "/usr/lib/python3.8/re.py", line 210, in sub
        return _compile(pattern, flags).sub(repl, string, count)
    TypeError: expected string or bytes-like object
    

    this example produces a different error. If the trailing space is removed from the end of the sentence, the example works.

I use the fairseq version as described in the readme.

Missing requirements? Fairseq/omegaconf not optional; failure to load transformers version on colab

Hi,

I am trying to follow the example to use GENRE with transformers.
https://github.com/facebookresearch/GENRE/blob/main/examples/transformers.ipynb

I put this into google colab for testing.
https://colab.research.google.com/drive/1hG4yRbrIe2XOZN1F_IZv7Xm0qIac40dU?usp=sharing

The example did not include installing GENRE, which i presume is a prerequisite for running things.

The error I get is:
ModuleNotFoundError Traceback (most recent call last)
in ()
4
5 with open("/content/kilt_titles_trie.pkl", "rb") as f:
----> 6 trie = pickle.load(f)
7
8 def prefix_allowed_tokens_fn(batch_id, sent):

1 frames
/content/GENRE/genre/base_model.py in ()
10
11 import torch
---> 12 from fairseq import search, utils
13 from fairseq.models.bart import BARTHubInterface, BARTModel
14 from omegaconf import open_dict

ModuleNotFoundError: No module named 'fairseq'

So, I installed fairseq, but there appears to be a version / dependency with fairseq on omegaconf?

ModuleNotFoundError Traceback (most recent call last)
in ()
4
5 with open("/content/kilt_titles_trie.pkl", "rb") as f:
----> 6 trie = pickle.load(f)
7
8 def prefix_allowed_tokens_fn(batch_id, sent):

4 frames
/content/fairseq/fairseq/dataclass/configs.py in ()
22 )
23
---> 24 from omegaconf import II, MISSING
25
26

ModuleNotFoundError: No module named 'omegaconf'

I don't think it's intended that the transformers version should require this. Also, what's the issue with omegaconf and fairseq, which presumably be a problem using it from that side?

Question about the dataset 'BLINK'

Hi,

I want to train an entity disambiguation system but i am not sure where I can get the dataset 'BLINK'. I cannot find a download link or instruction for building this pre-training dataset in the BLINK. And I find the author wrote this in his paper:

"We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data for the bi-encoder model (A total of 9M examples)."

Can I directly use KILT's 5.9M entities? And maybe what is the rule for selecting "the subset of all Wikipedia linked mentions"?

Thanks!

Extracted link is too long and does not get closed, or the sentence is truncated too early.

Hi everyone,

I was working on a small experiment to perform E2E entity linking on some wikipedia pages and I noticed an interesting behaviour while playing with the size of the beam. I am attaching this colab notebook, based on your HF example: https://colab.research.google.com/drive/1vy_lFCx2B0xvhYF09916df8UEBW5UENK#scrollTo=HLzGZ1_9v1EN . I tried feeding it a relatively long paragraph (the first paragraph of the Anarchism page on Wikipedia). The behaviour is interesting: for the first few links the model behaves correctly, but when certain links are met (in particular hierarchy) the behaviour can change erratically and ending up generating a suspiciously long wiki link that most likely does not exist but is instead the content of another page. When the beam size is too short the model simply stops generating quite early despite there are way more examples to try.

About the behaviour of GENRE with the given long example I had 2 questions:

  1. Why do you think in certain cases it gives up with a long list of PAD tokens? Why does it not try to complete the whole page?
  2. Why does the model report text from other wikipedia pages in place of terminating the links? May this be an error in the trie? I think I need to explore the trie structure a bit more to tell if entire pages have been erroneously spilled in the tree, and if yes if we could regenerate the tree. It is interesting that in some examples the closing "]" token is absent. Does the trie enforce the generation of the ] token or is the underlying BART model trying to do that? may this be related to #9 ?

I think the function get_end_to_end_prefix_allowed_tokens_fn_hf has some issues in the token selection. Hopefully I'll find some time later on to properly investigate the trie and figure

I'll also have a try with fairseq to check if the behaviour is reproducible.

get markdown for disambiguation model

Is there a way to add markdown on Wikipedia pages to the entities using the disambiguation model?

Because the example in the README with fairseq_entity_disambiguation_aidayago or hf_entity_disambiguation_aidayago is this:

sentences = ["Einstein was a [START_ENT] German [END_ENT] physicist."]

model.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)

While I see by using fairseq_e2e_entity_linking_aidayago or hf_e2e_entity_linking_aidayago, there is the function get_markdown which needs entity_spans.

from genre.utils import get_markdown
from IPython.display import Markdown

entity_spans = get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["Einstein", "Nobel Prize"]
    ]),
    mention_to_candidates_dict={
        "Einstein": ["Albert Einstein", "Einstein (surname)"],
        "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
    }
)

Markdown(get_markdown(sentences, entity_spans)[0])

In 1921, Einstein received a Nobel Prize.

How can I do the same with disambiguation model? I tried to re-build the trie, but it didn't work for me.

How to threshold by probability (end-to-end model)?

Hello,
First of all, thank you for a fantastic code release!

I'm wondering how to threshold the results of end-to-end genre model - In the disambiguation model, since probability is given for the single entity being disambiguated, I'm able to get a good filtering with a single value of confidence, i.e 75% probability (by exponentiating the logprob),

But in the case of the end-to-end model I'm not sure how we should filter by confidence. Would it be sufficient to normalize by the number of entities found? Thanks!

Details for training

Hi

Having looked through the scripts and documentation its unclear if all the steps and necessary files for being able to train the various models are included. Could you let us know whether everything required has been made available or not, and if not do you plan to at some point? So far I think i've found enough to have the possibility of trying the disambiguation model training but we were more interested in the E2E models.

Thanks

Tony

4 point interval between my finetune model and shared model

I am doing the entity disambiguation task.
The model you published "fairseq_entity_disambiguation_aidayago" get this result (without candidate and without the trie tree build specially for aida). This is almost same with issue #26 but the aida-test-kilt result is a little different, #26 is 87.89, here is 87.92.

1629877965(1)

With the same code, my finetune model get this result (use the same WIKI trie tree 'kilt_titles_trie_dict.pkl'). For the test dateset aida-test-kilt, there is a ~4 point gap here.

1629878164(1)

Although fairseq provide the default seed=1, each time i rerun the train.sh, the model is different, have different test result, but most of them are similar with the upon picture, where always have a ~4 point gap compare with the shared model.

Here is my finetune shell code.

  • Using the shared model 'fairseq_entity_disambiguation_blink' as pretrained model.
  • Based on the /GENRE/tree/main/scripts_genre/train.sh, i just change the file path and the --max-update 10000, --total-num-update 10000. So the finetune step is same as mentioned in the paper (10k). **
  • Remove the --reset-meter** and --reset-optimizer, so the parameter is initialized from fairseq_entity_disambiguation_blink and the optimizer didn't changed.
  • The code save 3 checkpoints: 50, 51, checkpoint_last. The I test each of them and the best result is still around 83% for aida-test-kilt. The dict.source.txt and dict.target.txt is same with 'fairseq_entity_disambiguation_blink', which is just change the name of the 'dict.txt' file from the bart.large model. (I just copy the two file from 'fairseq_entity_disambiguation_blink' to the finetuned model)
DATASET=GENRE-main/datasets/aida
BASED_MODEL=fairseq_entity_disambiguation_blink
NAME=fairseq_aida_basedon_blink_10k_default
STEP=10000

fairseq-train $DATASET/bin/ \
    --save-dir GENRE-main/models/$NAME \
    --tensorboard-logdir tensorboard_logs/$NAME \
    --restore-file GENRE-main/models/$BASED_MODEL/model.pt \
    --arch bart_large  \
    --task translation  \
    --criterion label_smoothed_cross_entropy  \
    --source-lang source  \
    --target-lang target  \
    --truncate-source  \
    --label-smoothing 0.1  \
    --max-tokens 1024  \
    --update-freq 1  \
    --max-update $STEP  \
    --required-batch-size-multiple 1  \
    --dropout 0.1  \
    --attention-dropout 0.1  \
    --relu-dropout 0.0  \
    --weight-decay 0.01  \
    --optimizer adam  \
    --adam-betas "(0.9, 0.999)"  \
    --adam-eps 1e-08  \
    --clip-norm 0.1  \
    --lr-scheduler polynomial_decay  \
    --lr 3e-05  \
    --total-num-update $STEP  \
    --warmup-updates 500  \
    --ddp-backend no_c10d  \
    --num-workers 20  \ 
    --share-all-embeddings \
    --layernorm-embedding \
    --share-decoder-input-output-embed  \
    --skip-invalid-size-inputs-valid-test  \
    --log-format json  \
    --log-interval 10  \
    --patience 200  \

Here is my runing environment: pytorch-1.6.0 , cuda10cudnn7 , python 3.7.7
I wanna know if something wrong with my finetune procedure? Since 83 to 87 has really a big gap.

Thanks for your work again~

Runtime error when running mgenre example

Hey,

I ran the commands as-is in the example_mgenre script, however I am getting the following error when running models.sample:

model.sample(
sentences,
prefix_allowed_tokens_fn=lambda batch_id, sent: [
e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
],
)

Traceback (most recent call last):
File "", line 1, in
File "/usr2/home//GENRE/genre/fairseq_model.py", line 38, in sample
batched_hypos = self.generate(
File "/usr2/home//GENRE/genre/fairseq_model.py", line 87, in generate
def generate(self, *args, **kwargs) -> List[List[Dict[str, torch.Tensor]]]:
File "/usr2/home//fairseq/fairseq/hub_utils.py", line 179, in generate
generator, self.models, batch, **inference_step_args
TypeError: inference_step() argument after ** must be a mapping, not int

This is the fairseq version I installed (https://github.com/nicola-decao/fairseq/tree/fixing_prefix_allowed_tokens_fn)

error in mgenre model.sample

Hello,
I was trying to do entity linking on some sentences from different languages. I got this error:

code:

sentences= ['በምርጫው እንደማይወዳደሩ ቀደም ሲል ካስታወቁ በኋላ ፓርቲው በመጨረሻ ባካሄደው ጉባዔው [START] ዦዋዎ ሉሬንቾን [END] ቀዳሚው እጩ አድርጎ ሰይሟል ።']

 model.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: [
        e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
    ],
    text_to_id=lambda x: max(lang_title2wikidataID[tuple(reversed(x.split(" >> ")))], key=lambda y: int(y[1:])),
    marginalize=True,
)

error:

Traceback (most recent call last):
  File "temp.py", line 46, in <module>
    x = model.sample(
  File "/projects/antonis/fahim/ner_linking/GENRE/genre/fairseq_model.py", line 37, in sample
    batched_hypos = self.generate(
  File "/projects/antonis/fahim/ner_linking/GENRE/genre/fairseq_model.py", line 92, in generate
    return super(BARTHubInterface, self).generate(*args, **kwargs)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/hub_utils.py", line 178, in generate
    translations = self.task.inference_step(
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/tasks/fairseq_task.py", line 501, in inference_step
    return generator.generate(
  File "/home/ffaisal/fairseq/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/sequence_generator.py", line 886, in generate
    finalized = super()._generate(sample, **kwargs)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/sequence_generator.py", line 242, in _generate
    encoder_outs = self.model.forward_encoder(net_input)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/sequence_generator.py", line 757, in forward_encoder
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/sequence_generator.py", line 757, in <listcomp>
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript
    return self.forward_non_torchscript(net_input)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
    return self.forward(**encoder_input)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/models/transformer.py", line 437, in forward
    return self.forward_scriptable(src_tokens,
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/models/transformer.py", line 480, in forward_scriptable
    x, encoder_embedding = self.forward_embedding(src_tokens, token_embeddings)
  File "/projects/antonis/fahim/ner_linking/fairseq/fairseq/models/transformer.py", line 396, in forward_embedding
    token_embedding = self.embed_tokens(src_tokens)
  File "/home/ffaisal/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ffaisal/fairseq/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/ffaisal/fairseq/lib/python3.8/site-packages/torch/nn/functional.py", line 2043, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

kindly let me know if anyone has any idea how to solve this...this code works fine in some language sentence lists and in some cases like this, show this error. Thanks

HuggingFace Model Differences

Hi

After spending some time comparing the different outputs between the FairSeq (FS) and HuggingFace (HF) model a couple of things have come to light. Probably the most significant is the HF model config has a parameter with a default value significantly impacting its output, the "no_repeat_ngram_size" is set to 3 which causes it to try avoiding repetitions and ideally should be set to zero (which is comparable to the FS setup). Without this HF can produce bad/invalid output (some of which can cause RuntimeExceptions to be thrown in the decoder). The parameter can be supplied in the model.generate call (no_repeat_ngram_size=0) this would need the utils methods to be modified to handle this as well.

Another difference is the minimum (not so much of an issue) and the maximum generate length limits, by default HF is set to 20 in the method but again set to 62 in the model config, rather than 200 as used in FS. Wrt to the examples given you can override this when calling sample but for things like annotation span generation this needs setting in the utils functions.

One difference that remains seems to be handling of a whitespace before a comma between the encoder/decoder pairs of the different models, which results in FS producing sequences with a space before commas but HF has no space.

On a side note anyone using HF may need to implement their own batching to decode multiple sentences (if using the get_entity_spans_hf type methods this could be implemented within them also). Also truncating input sentence lengths to avoid generating too long an input sequence.

Hopefully that will let people use either setup with nearly comparable results.

Thanks

Tony

RuntimeError when using get_entity_spans

This error comes out in a non predicibile way when using get_entity_spans

# entity wikipedia link
entity_spans = get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["Einstein", "Nobel Prize"]
    ]),
    mention_to_candidates_dict={
        "Einstein": ["Albert Einstein", "Einstein (surname)"],
        "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
    }
)
print(get_markdown(sentences, entity_spans)[0])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-48-2c88416e3c2b> in <module>
      9     mention_to_candidates_dict={
     10         "Einstein": ["Albert Einstein", "Einstein (surname)"],
---> 11         "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
     12     }
     13 )

~/SageMaker/hf-experiments/src/genre/genre/utils.py in get_entity_spans_hf(model, input_sentences, mention_trie, candidates_trie, mention_to_candidates_dict, redirections)
    197 
    198     return get_entity_spans_finalize(
--> 199         input_sentences, output_sentences, redirections=redirections
    200     )
    201 

~/SageMaker/hf-experiments/src/genre/genre/utils.py in get_entity_spans_finalize(input_sentences, output_sentences, redirections)
    228                     status = "m"
    229                 else:
--> 230                     raise RuntimeError
    231 
    232             elif status == "m":

RuntimeError: 

In normal output I would get as usual

In 1921, [Einstein](https://en.wikipedia.org/wiki/Albert_Einstein) received a Nobel Prize.

Start and end positions of tokens

Thanks for your interesting work.
afaik, it is necessary to specify the start and end of tokens in input sentences and also one tag is possible for each sentence at a time.
So, If we want to use it to annotate the content of a webpage, it is necessary to specify the words at first, right?
could u please explain what get_entity_spans does?
#20
Is is responsible to detect the tags and their start and end positions?

'GENREHubInterface' object has no attribute '_build_batches'

While Inferring the output for end-to-end entity linking with the "fairseq_e2e_entity_linking_aidayago" model this error message appears.

code :
from genre.fairseq_model import GENRE
from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_fairseq as get_prefix_allowed_tokens_fn
from genre.utils import get_entity_spans_fairseq as get_entity_spans
model = GENRE.from_pretrained("/content/GENRE/fairseq_e2e_entity_linking_aidayago").eval()

sentences = ["In 1921, Einstein received a Nobel Prize."]

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(model, sentences)

model.sample(
sentences,
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
)

error msg -> AttributeError: 'GENREHubInterface' object has no attribute '_build_batches'
error :
AttributeError Traceback (most recent call last)
in ()
5 model.sample(
6 sentences,
----> 7 prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
8 )

2 frames
/usr/local/lib/python3.7/dist-packages/genre/fairseq_model.py in sample(self, sentences, beam, verbose, **kwargs)
24 return self.sample([sentences], beam=beam, verbose=verbose, **kwargs)[0]
25 tokenized_sentences = [self.encode(sentence) for sentence in sentences]
---> 26 batched_hypos = self.generate(tokenized_sentences, beam, verbose, **kwargs)
27 return [
28 [

/usr/local/lib/python3.7/dist-packages/genre/fairseq_model.py in generate(self, tokenized_sentences, inference_step_args, skip_invalid_size_inputs, *args, **kwargs)
45 raise NotImplementedError("prefix generation not implemented for BART")
46 res = []
---> 47 for batch in self._build_batches(tokenized_sentences, skip_invalid_size_inputs):
48 src_tokens = batch["net_input"]["src_tokens"]
49 results = super(BARTHubInterface, self).generate(

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in getattr(self, name)
946 return modules[name]
947 raise AttributeError("'{}' object has no attribute '{}'".format(
--> 948 type(self).name, name))
949
950 def setattr(self, name: str, value: Union[Tensor, 'Module']) -> None:

AttributeError: 'GENREHubInterface' object has no attribute '_build_batches'

: 'NoneType' object has no attribute 'bpe'

Hello
Could you please guide me how to solve the following error when the following line needs to be run:
model = mGENRE.from_pretrained("/scratch/c7031329/data/GENRE-main/dl/fairseq_multilingual_entity_disambiguation").eval()

AttributeError: 'NoneType' object has no attribute 'bpe'

Sorry if it seems to be a simple issue!

Re-create the trie tree

Goal: I am trying to create the kilt_titles_trie_dict.pkl file by myself. So that I can create trie tree for other data if I can perfectly re-create the KILT tree.

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie_dict.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Data Preparation: I download the raw format Wikipedia from the link you provided in GENRE/examples_genre/README.md. Unzip the download file and got enwiki-pages-articles.xml. I use Tool wikiextractor to analysis the .xml file and get the data format like this:

\<doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"\>
AccessibleComputing



\</doc\>
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is an anti-authoritarian political philosophy that rejects hierarchies deemed unjust and advocates their replacement with self-managed, self-governed societies based on voluntary, cooperative institutions. These institutions are often described as stateless societies, although several authors have defined them more specifically as distinct institutions based on non-hierarchical or free associations. Anarchism's central disagreement with other ideologies is that it holds the state to be undesirable, unnecessary, and harmful.
...

\</doc\>
  • The first line: <doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"> , the title is the target entity.

  • The fourth line:"Anarchism is an ..." is a description for the entity.

I extract this two with the code below as a dict {title: description}:

import os
import pickle
import re
from multiprocessing import Pool, Manager

manager = Manager()
dict_title_description = manager.dict()

def process_a_file(in_path):
    print(in_path)
    with open(in_path, 'r', encoding='utf8') as r_f:
        lines = r_f.readlines()
    i = 0
    while i < len(lines):
        tmp_line = lines[i]
        if tmp_line.startswith('<doc id='):
            title = re.findall(r'<doc id=.*title="(.*)"', tmp_line)[0]
            try:
                description = lines[i + 3].rstrip('\n')
            except IndexError:
                break
            dict_title_description.update({title: description})
            i += 6
        else:
            i += 1
    return dict_title_description

if __name__ == '__main__':
    in_path = "wikiextractor-master/wikiextractor/enwiki_data/AA"
    out_path_dict = "wikiextractor-master/wikiextractor/enwiki_data/dict.pkl"
    out_path_list = "wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl"
    final_dict = {}

    pool = Pool(processes=14)

    for root, dirnames, filenames in os.walk(in_path):
        for fname in filenames:
            path_list.append(in_path + '/' + fname)
    pool.map_async(process_a_file, path_list)
    pool.close()
    pool.join()
    final_dict = dict_title_description
    list_title = list(final_dict.keys())
    list_title.sort()

    with open(out_path_dict, 'wb') as w_f:
        pickle.dump(final_dict, w_f)
    with open(out_path_list, 'wb') as w_f:
        pickle.dump(list_title, w_f)

I use the list_title (key of the dict) to generate the trie tree. (but the list_title length is 14608727 (not the same as you mentioned in GENRE/examples_genre/README.md (where the number is ~5M titles).

Trie tree creation: with the title list, I use the code below to create my wiki trie tree and get the file our_kilt_titles_trie_dict.pkl:

from genre.fairseq_model import GENRE
from genre.trie import Trie
import pickle

model = (
        GENRE.from_pretrained("../models/fairseq_entity_disambiguation_aidayago", checkpoint_file='model.pt')
        .eval()
        .to('cpu')
    )

# entities = ['AccessibleComputing', 'Anarchism']
with open('wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl', 'rb') as r_f:
    entities = pickle.load(r_f)
print(len(entities))
trie = Trie([2]+model.encode(entity)[1:].tolist() for entity in entities).trie_dict
with open('our_kilt_titles_trie_dict.pkl', 'wb') as w_f:
    pickle.dump(trie, w_f)
print("finish running!")

I use kilt_titles_trie_dict.pkl and our_kilt_titles_trie_dict.pkl for same model testing, and they get different result. The fairseq_blink_200k_default_no_reset is pretain model based on Blink data with kilt_titles_trie_dict.pkl. Use different trie tree to test, there is a 2 point gap.
捕获
捕获1

Question: I want to know if you also do some filtering when creating kilt_titles_trie_dict.pkl. (I know you have do some filtering when create special tree for Aida as I read Issue #37 .) Since the title number and the result is not match.

Quesitons about the Generation of " [" and " ]".

In the evaluation mode, I was confused about how the model generate token " [" and " ]". I have read the code in the entity_linking.py file, in the funcation _get_end_to_end_prefix_allowed_tokens_fn(), I didn't see any code related to the generation of " [" or " ]".

In addition to this, function get_trie_entity in the entity_linking.py return the candidates tokens constrained by the entities trie.

return candidates_trie_tmp.get(sent[pointer_end:])

In this line, sent[pointer_end:] will point to a sequence start with token id 35524, which is ' }'. So use the trie loaded from kilt_titles_trie.pkl will alwalys get [] as return, beacase kilt_titles_trie.pkl is start with token id 2 (</s>).

I'm not sure if this is the right understanding, look forward to your reply.

'GENREHubInterface' object has no attribute 'cfg'

When trying to run the notebook fairseq.ipynb (Entity Disambiguation):

import pickle

with open("data/kilt_titles_trie.pkl", "rb") as f:
    trie = pickle.load(f)

def prefix_allowed_tokens_fn(batch_id, sent):
    return trie.get(sent.tolist())

from genre import GENRE
model = GENRE.from_pretrained("models/fairseq_entity_disambiguation_aidayago").eval()

sentences = ["[START_ENT] Armstrong [END_ENT] was the first man on the Moon."]

model.sample(
    sentences,
    prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
)

I have the following error:

---------------------------------------------------------------------------
ModuleAttributeError                      Traceback (most recent call last)
<ipython-input-3-100e2b2ca1be> in <module>
      3 model.sample(
      4     sentences,
----> 5     prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
      6 )

~/Documents/Sources/GENRE/genre/base_model.py in sample(self, sentences, beam, verbose, **kwargs)
    126             return self.sample([sentences], beam=beam, verbose=verbose, **kwargs)[0]
    127         tokenized_sentences = [self.encode(sentence) for sentence in sentences]
--> 128         batched_hypos = self.generate(tokenized_sentences, beam, verbose, **kwargs)
    129         return [
    130             [

~/Documents/Sources/GENRE/genre/base_model.py in generate(self, tokenized_sentences, beam, verbose, skip_invalid_size_inputs, inference_step_args, **kwargs)
    151 
    152         # build generator using current args as well as any kwargs
--> 153         gen_args = copy.copy(self.cfg)
    154         with open_dict(gen_args):
    155             gen_args.beam = beam

~/miniconda3/envs/py37_genre/lib/python3.7/site-packages/torch/nn/modules/module.py in __getattr__(self, name)
    777                 return modules[name]
    778         raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
--> 779             type(self).__name__, name))
    780 
    781     def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:

ModuleAttributeError: 'GENREHubInterface' object has no attribute 'cfg'

issue in huggingface prefix_allowed_tokens_fn

Hello,

I tried to use the constrained beam search with huggingface and realized that @nicola-decao has added the function by prefix_allowed_tokens_fn in huggingface generation. However, I am occasionally getting an error of generating a token that is not in the constraint.
For example, when given constraint {2: [3, 6, 47], 3: [6], 6: [47], 47: [3]}, when given [2,6] as input_ids, I get number other than 47 which is the only possible output given this constraint. Is there any way I can solve this or is there anything I'm missing?

Thanks!

Evaluation on entity linking and disambiguation

Hi,

Very inspiring work! Thanks for sharing the code. I am trying to do an error analysis on the entity linking results. I was wondering if you could provide a script for evaluating the released model on the entity disambiguation/linking datasets? Thank you.

why do the same words in different sentence have different entity linking results?

Sentence 1: "Orthostatic hypotension was ameliorated 4 days after withdrawal of selegiline and totally abolished 7 days after discontinuation of the drug ."
Entity linking results: [[0, 23, "Orthostatic_hypotension"], [67, 10, "Selegiline"], [136, 4, "Drug"]]

Sentence 2: "A lesser degree of orthostatic hypotension occurred with standing ."
Entity linking results: [[57, 8, "Standing"], [69, 0, "__A"]]

why do the words "orthostatic hypotension" in sentence 2 link with nothing?

Tagging multiple spans at a time

Hi,

Thank you for your work, really interesting!
I'm using mGENRE model (I'm following these instructions) and it works fine.
From the examples, it seems that is possible to tag only a text span (for sentence) at a time.
I was wondering if it is possible to tag multiple spans at a time e.g. [START] Einstein [END] was a German [START] physicist [END]

Thanks!

About the entity disambiguation performance without candidate set

Hi,

Thanks for your work.

I run the experiment of evaluating the entity disambiguation performance without candidate set.
As shown in the paper, the performance should be.
image

However, when I run the entity disambiguation without candidate set using the provided checkpoint.
python evaluate_kilt_dataset.py path_to/fairseq_entity_disambiguation_aidayago path_to/datasets path_to/predictions --trie path_to/kilt_titles_trie_dict.pkl --batch_size 64 --device "cuda:0"
It gives the performance:

image

Is there anything wrong for my run?

Settings for evaluation

Hi! Thank you for your great work.

I have two questions related to evaluation.

  1. Could you share the setting details when you evaluate GENRE? Do you have any plan to share the evaluation environment (including middleware for GERBIL, which is used to communicate with its platform)?
  2. Is there a setting to allow the GPU to be used for inference? (By default, GPU is not used for inference, right?)

Details:
To confirm my GENRE settings are the best-performed ones, I have been trying to reproduce the end-to-end entity linking results, but could not.

I followed this example, and evaluated using the GERBIL platform.

To extract the spans, I used _get_entity_spans function.

I used KORE50 datasets for evaluation, but could not get good results. (for results, see here)

Maybe there is a problem with my GENRE settings, pre-processing or post-processing. I would appreciate it if you could give me more details.

Best practice to link already known entities

I'm executing some examples in this page with hf_e2e_entity_linking_aidayago model. I have already found the entities, so I only want to link them to Wikipedia given a list of hints. I have encountered some problems:

  • When I put more than one mention_trie I don't see all the results, for example in this case:
from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_hf as get_prefix_allowed_tokens_fn
from genre.utils import get_entity_spans_hf as get_entity_spans
model = GENRE.from_pretrained("../models/hf_e2e_entity_linking_aidayago").eval()

sentences = ["In 1921, Einstein received a Nobel Prize."]

get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["Einstein", "Nobel Prize"]
    ]),
    mention_to_candidates_dict={
        "Einstein": ["Albert Einstein", "Einstein (surname)"],
        "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
    }
)

My output is:

[[(9, 8, 'Albert_Einstein')]]

If I remove both Einstein from mention_trie and mention_to_candidates_dict the result is:

[[(29, 11, 'Nobel_Prize_in_Physics')]]

But in the example shown in the README both the two entities should be visible.

  • Moreover I'm encountering some problems (maybe bug) in the definition of the entities to search like here:
sentences = ["George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, Bush previously served as the 46th governor of Texas from 1995 to 2000. He was born into the Bush family; his father, George H. W. Bush, was the 41st president of the United States from 1989 to 1993."]

get_entity_spans(
    model,
    sentences,
    mention_trie=Trie([
        model.encode(" {}".format(e))[1:].tolist()
        for e in ["George Walker Bush"]
    ]),
    mention_to_candidates_dict={
        "George Walker Bush": ["George W. Bush", "George H. W. Bush"],
    }
)

The result is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-2ae034f807cd> in <module>
      1 sentences = ["George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, Bush previously served as the 46th governor of Texas from 1995 to 2000. He was born into the Bush family; his father, George H. W. Bush, was the 41st president of the United States from 1989 to 1993."]
      2 
----> 3 get_entity_spans(
      4     model,
      5     sentences,

~/genre/genre/utils.py in get_entity_spans_hf(model, input_sentences, mention_trie, candidates_trie, mention_to_candidates_dict, redirections)
    176     redirections=None,
    177 ):
--> 178     return _get_entity_spans(
    179         model,
    180         input_sentences,

~/genre/utils.py in _get_entity_spans(model, input_sentences, prefix_allowed_tokens_fn, redirections)
    141     )
    142 
--> 143     return get_entity_spans_finalize(
    144         input_sentences, output_sentences, redirections=redirections
    145     )

~/genre/utils.py in get_entity_spans_finalize(input_sentences, output_sentences, redirections)
    218                     status = "m"
    219                 else:
--> 220                     raise RuntimeError
    221 
    222             elif status == "m":

RuntimeError: 

So my question is:
if I already have the entities, but I want link them to Wikipedia (I have also candidates of the Wikipedia pages, so I only want to disambiguate them) which is the best function ( get_prefix_allowed_tokens_fn, get_entity_spans,?) and how I have to declare my entities and my candidates to avoid these problems?

Thank you! It seems a very promising work! 😊

cant run the example code

Hello, thank you for sharing the code!

I met some issues when running the example inference code of GENRE. I followed the instructions on installing the dev version of fairseq.

Any ideas why? Thanks.

Traceback (most recent call last):
File "/disk/luqh/multilingualEL/fairseq/fairseq/data/data_utils.py", line 302, in batch_by_size
from fairseq.data.data_utils_fast import (
File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "test.py", line 19, in
prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
File "/disk/luqh/multilingualEL/GENRE/genre/fairseq_model.py", line 43, in sample
**kwargs,
File "/disk/luqh/multilingualEL/GENRE/genre/fairseq_model.py", line 92, in generate
return super(BARTHubInterface, self).generate(*args, **kwargs)
File "/disk/luqh/multilingualEL/fairseq/fairseq/hub_utils.py", line 176, in generate
for batch in self._build_batches(tokenized_sentences, skip_invalid_size_inputs):
File "/disk/luqh/multilingualEL/fairseq/fairseq/hub_utils.py", line 269, in _build_batches
disable_iterator_cache=True,
File "/disk/luqh/multilingualEL/fairseq/fairseq/tasks/fairseq_task.py", line 289, in get_batch_iterator
required_batch_size_multiple=required_batch_size_multiple,
File "/disk/luqh/multilingualEL/fairseq/fairseq/data/fairseq_dataset.py", line 152, in batch_by_size
fixed_shapes=fixed_shapes,
File "/disk/luqh/multilingualEL/fairseq/fairseq/data/data_utils.py", line 314, in batch_by_size
"Please build (or rebuild) Cython components with: pip install " ValueError: Please build (or rebuild) Cython components with: pip install --editable .orpython setup.py build_ext --inplace`.

Problem in reproducing entity linking results

Hello, thank you for the very interesting idea and work!

However, I have a few questions about the reproduction of the results of end-to-end entity linking, as I failed to obtain comparable results:

  1. Could you share the details about the functions used in generating the results and corresponding inputs/settings?
  2. Could you share the entity universe and mention-to-candidate mappings from the paper End-to-End Neural Entity Linking, as mentioned in issue #30? Could you also detail how these are used in the function get_prefix_allowed_token_fn? (i.e. what is used for mention_trie and and what is used for mention-to-candidate mapping trie)
  3. If it is possible, could you please share the code that runs the experiments on the GERBIL platform?

Thank you very much!

Below are the details about how I ran my experiments:

Details:
I used the model fairseq_e2e_entity_linking_aidayago and evaluated it on the GERBIL platform. I preprocessed the sentences using the function get_entity_spans_pre_processing, generated predictions using model.sample, post-preprocessed the predictions using get_entity_spans_post_processing and generated the entity triples by get_entity_spans_finalize. The get_prefix_allowed_token_fn used in model.sample takes None for mention_trie and mention_to_candidates_dict.

I evaluated the results on KORE50 and Derczynski.
The result for KORE50 is:
micro F1: 42.11
micro precision: 37.78
micro recall: 47.55
macro F1: 39.83
macro precision: 36.87
macro recall: 45.03

The result for Derczynski is:
micro F1: 47.49
micro precision: 39.49
micro recall: 59.56
macro F1: 44.18
macro precision: 44.26
macro recall: 47.3

Maybe there are some problems with my GENRE settings, the functions I used to generate the predictions and etc. I would really appreciate it if you could let me know the mistakes and share some more details.

Potential bugs in evaluate_kilt_dataset.py

Hi,

Thanks for your work.
I want to run evaluate_kilt_dataset.py for evaluation purpose.

I found two problems:

  1. iter_ = tqdm(batch_it(dataset, len(dataset) // batch_size), desc="Evaluating")

Should it be?
iter_ = tqdm(batch_it(dataset, batch_size), desc="Evaluating")

  1. Model behavior diffs a lot when changing the batch_size

Specifically, when batch_size = 1, --candidates, and other default parameters.
The performance on ace2004-test-kilt is:
f1=0.897, prec=0.929, rec=0.868
When the batch_size = 64, --candidates and other default parameters.
The performance on ace2004-test-kilt is:
f1=0.0524, prec=0.0544, rec=0.0506
I also tried other batch_sizes, each time the performance is different, so I guess there are potential bugs in this script.

Could you please take a look at it and can you provide the command to generate the entity disambiguation performance reported in the ICLR 21 paper?
Thanks a lot.

get_entity_spans and model.sample output in entity linking

How to get the same output of model.samplewith get_entity_spans when linking entities in a given text?
By example, in this case (using hf_model) I will get different entities output

model = GENRE.from_pretrained(os.path.join(cache_dir,"hf_e2e_entity_linking_wiki_abs")).eval()
sentences = ["Tired of the lies? Tired of the spin? Are you ready to hear the hard-hitting truth in comprehensive, conservative, principled fashion? The Ben Shapiro Show brings you all the news you need to know in the most fast moving daily program in America."]
entity_spans = get_entity_spans(
    model,
    sentences)
print(entity_spans)

prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(model, sentences)
out = model.sample(
    sentences,
    prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
)
print(out)

[Entity Disambiguation] some training samples in blink-train-kilt have similar samples in wned-wiki

Hi,

We found that there are some samples in blink-train-kilt.jsonl that are very similar to some samples in the test set wned-wiki. Did you remove this part of the data during training? From the experimental results, GENRE is not much improved on other datasets(ACE2004, CWEB, AIDA-b, AQUAINT), but shows significant improvement on wned-wiki, is it related to these possible leaked samples?

Here are some examples (Since wikipedia has been updated compared to the wned-wiki dataset, I cannot filter out this type of samples by exact matching):

  1. Wikipedia page: https://en.wikipedia.org/wiki/Big_Blue_River_(Indiana)
    blink-train-kilt:{"id": "blink-train-731858", "input": "The Big Blue River is an [START_ENT] tributary [END_ENT] of the Driftwood River in east-central Indiana in the United States. Via the Driftwood, White, Wabash and Ohio rivers, it is part of the watershed of the Mississippi River.", "output": [{"answer": "Tributary", "provenance": [{"wikipedia_id": "72465", "title": "Tributary"}]}], "meta": {"mention": "tributary", ...}}
    wiki-test-kilt: {"id": 22, "input": "The Big Blue River is an [START_ENT] tributary [END_ENT] of the Driftwood River in east central Indiana in the United States Via the Driftwood White Wabash and Ohio rivers it is part of the watershed of the Mississippi River The Big Blue rises in northeastern Henry County and flows generally southwestwardly through Rush Hancock Shelby and Johnson counties past the towns of New Castle Knightstown Carthage Morristown Shelbyville and Edinburgh It joins Sugar Creek to form the Driftwood River west of Edinburgh At Shelbyville it collects the", "output": [{"answer": "Tributary", "provenance": [{"title": "Tributary"}]}], "meta": {..., "mention": "tributary"}, "candidates": [...]}

and under the same page, there are also pairs: (blink-train-6446404, wiki-test-kilt-id-30), (blink-train-6621890, wiki-test-kilt-id-27)

  1. Wikipedia page: https://en.wikipedia.org/wiki/Energy_in_Sudan
    blink-train-kilt: {"id": "blink-train-2613656", "input": "Energy in Sudan describes energy and [START_ENT] electricity [END_ENT] production, consumption and imports in Sudan. Sudan is a net energy exporter. Primary energy use in Sudan was 179 kWh and 4 kWh per million persons in 2008.", "output": [{"answer": "Electricity generation", "provenance": [{"wikipedia_id": "9540", "title": "Electricity generation"}]}], "meta": {"mention": "electricity", ...}
    wiki-test-kilt: {"id": 357, "input": "Energy in Sudan describes and [START_ENT] electricity [END_ENT] production consumption and imports in Sudan Sudan is a net energy exporter Primary energy use in Sudan was 179 kWh and 4 kWh per million persons in 2008 The world share of energy production in Africa was 12 percent of oil and 7 percent of gas in 2009 In 2010 major energy producers in Africa were Algeria Angola Cameroon Democratic Republic of the Congo Equatorial Guinea Gabon Libya Nigeria and Sudan According to the OECD and the World Bank the population growth of from 2004 to 2008 was 16 4 percent in comparison to the world average of 5 3", "output": [{"answer": "Electricity generation", "provenance": [{"title": "Electricity generation"}]}], "meta": {..., "mention": "electricity"}, "candidates": ...}

  2. Wikipedia page: https://en.wikipedia.org/wiki/2009_European_Pairs_Speedway_Championship
    blink-train-kilt: {"id": "blink-train-1894399", "input": "The 2009 European Pairs Speedway Championship will be the 6th UEM European Pairs Speedway Championship season. The Final was held on 26 September 2009 in Miskolc, Hungary; it was second Final in Hungary, but first in Miskolc. The championship was won by [START_ENT] Czech Republic pair [END_ENT] and they beat Russia and the defending Champions Poland.", "output": [{"answer": "Czech Republic national speedway team", "provenance": [{"wikipedia_id": "13444681", "title": "Czech Republic national speedway team"}]}], "meta": {"mention": "Czech Republic pair", ...}}
    wiki-test-kilt: {"id": 174, "input": "The 2009 European Pairs Speedway Championship will be the 6th UEM European Pairs Speedway Championship season The Final was held on 26 September 2009 in Miskolc Hungary it was second Final in Hungary but first in Miskolc The championship was won by [START_ENT] Czech Republic pair [END_ENT] and they beat Russia and the defending Champions Poland In the Final will be the defending Champion Poland Czech Republic 2nd place in 2008 Final Russia 3rd place host team Hungary 4th place and Latvia 5th place A last finalist will be determined in one Semi Final In Ljubljana Slovenia on May 13 will be Austria 6th place Germany 7th place Ukraine Finland host team Slovenia Italy and Croatia", "output": [{"answer": "Czech Republic national speedway team", "provenance": [{"title": "Czech Republic national speedway team"}]}], "meta": {..., "mention": "Czech Republic pair"}, "candidates": [...]}

Wikiextractor version and Wikidata

Hi, I have some issues when I am running download_wiki.sh. I noticed that the latest wikiextractor doesn't have "--list --sections" option. Could you share which version you use?

Besides, could you also guide how to get this data wikidata-all.json?

Thank you for the response!

Finetuning mGENRE using fairseq-train - please ensure architectures match

I processed my Icelandic dataset into KILT and then into fairseq binary format using the codes in the repo (with sentencepiece and mGENRE's dictionary) with the goal of finetuning mGENRE.

The model I'm using is model.pt from fairseq_multilingual_entity_disambiguation.
I used scripts_genre/train.sh and when I ran it I got the following error:

RuntimeError: Error(s) in loading state_dict for BARTModel:
Unexpected key(s) in state_dict: "encoder.layer_norm.weight", "encoder.layer_norm.bias", "decoder.layer_norm.weight", "decoder.layer_norm.bias".

Exception: Cannot load model parameters from checkpoint mgenre/model.pt; please ensure that the architectures match.

I'm able to train GENRE using the same train.sh script (for data I processed using BPE) without a problem. But not mGENRE.

I tried changing the --arch parameter to mbart_large, but still get the same error.

Any idea what parameters I need to change to make calling fairseq-train work?

Question about

Hey,

Similar to the GENRE model example of Document Retieval, is it possible to get entity linking from sentences without the pre-speciying the [START] and [END] tokens for multilingual text?

Fairseq build fails on build from https://github.com/nicola-decao/fairseq.git@fixing_prefix_allowed_tokens_fn

Currently reliably unable to install fairseq as specified - likely doing something wrong here, but here's the trace:

machine@host:/usr/blah/genre$ conda activate
(base) machine@host:/usr/blah/genre$ pip install git+https://github.com/nicola-decao/fairseq.git@fixing_prefix_allowed_tokens_fn
Collecting git+https://github.com/nicola-decao/fairseq.git@fixing_prefix_allowed_tokens_fn
  Cloning https://github.com/nicola-decao/fairseq.git (to revision fixing_prefix_allowed_tokens_fn) to /tmp/pip-req-build-y8u09mgl
  Running command git clone -q https://github.com/nicola-decao/fairseq.git /tmp/pip-req-build-y8u09mgl
  Running command git checkout -b fixing_prefix_allowed_tokens_fn --track origin/fixing_prefix_allowed_tokens_fn
  Switched to a new branch 'fixing_prefix_allowed_tokens_fn'
  Branch 'fixing_prefix_allowed_tokens_fn' set up to track remote branch 'fixing_prefix_allowed_tokens_fn' from 'origin'.
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
    Preparing wheel metadata ... done
Requirement already satisfied (use --upgrade to upgrade): fairseq==1.0.0a0+4c4d5a7 from git+https://github.com/nicola-decao/fairseq.git@fixing_prefix_allowed_tokens_fn in /home/cryptochat/anaconda3/lib/python3.8/site-packages
Requirement already satisfied: hydra-core<1.1 in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (1.0.6)
Requirement already satisfied: sacrebleu>=1.4.12 in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (1.5.1)
Requirement already satisfied: tqdm in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (4.47.0)
Requirement already satisfied: numpy; python_version >= "3.7" in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (1.18.5)
Requirement already satisfied: cython in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (0.29.21)
Requirement already satisfied: omegaconf<2.1 in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (2.0.6)
Requirement already satisfied: regex in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (2020.6.8)
Requirement already satisfied: torch in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (1.8.1)
Requirement already satisfied: cffi in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from fairseq==1.0.0a0+4c4d5a7) (1.14.0)
Requirement already satisfied: importlib-resources; python_version < "3.9" in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from hydra-core<1.1->fairseq==1.0.0a0+4c4d5a7) (5.1.4)
Requirement already satisfied: antlr4-python3-runtime==4.8 in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from hydra-core<1.1->fairseq==1.0.0a0+4c4d5a7) (4.8)
Requirement already satisfied: portalocker==2.0.0 in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from sacrebleu>=1.4.12->fairseq==1.0.0a0+4c4d5a7) (2.0.0)
Requirement already satisfied: PyYAML>=5.1.* in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from omegaconf<2.1->fairseq==1.0.0a0+4c4d5a7) (5.3.1)
Requirement already satisfied: typing-extensions in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from omegaconf<2.1->fairseq==1.0.0a0+4c4d5a7) (3.7.4.2)
Requirement already satisfied: pycparser in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from cffi->fairseq==1.0.0a0+4c4d5a7) (2.20)
Requirement already satisfied: zipp>=3.1.0; python_version < "3.10" in /home/cryptochat/anaconda3/lib/python3.8/site-packages (from importlib-resources; python_version < "3.9"->hydra-core<1.1->fairseq==1.0.0a0+4c4d5a7) (3.1.0)
Building wheels for collected packages: fairseq
  Building wheel for fairseq (PEP 517) ... done
  Created wheel for fairseq: filename=fairseq-1.0.0a0+4c4d5a7-cp38-cp38-linux_x86_64.whl size=2106379 sha256=e2eb7fe8efbf007312cb87eaf8f538cbdc88a00315aced82ac0532c7f4a5d722
  Stored in directory: /tmp/pip-ephem-wheel-cache-ysl4vpiz/wheels/d8/2a/fe/c8943070346a6761277e3915908a7c35cf666117b6859f058a
Successfully built fairseq
(base) machine@host:/usr/blah/genre$ ls
(base) machine@host:/usr/blah/genre$ python3
Python 3.8.3 (default, Jul  2 2020, 16:21:59) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fairseq
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cryptochat/anaconda3/lib/python3.8/site-packages/fairseq/__init__.py", line 32, in <module>
    import fairseq.criterions  # noqa
  File "/home/cryptochat/anaconda3/lib/python3.8/site-packages/fairseq/criterions/__init__.py", line 36, in <module>
    importlib.import_module("fairseq.criterions." + file_name)
  File "/home/cryptochat/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cryptochat/anaconda3/lib/python3.8/site-packages/fairseq/criterions/label_smoothed_cross_entropy_latency_augmented.py", line 6, in <module>
    from examples.simultaneous_translation.utils.latency import LatencyTraining
ModuleNotFoundError: No module named 'examples'
>>> 

mGENRE crashes on certain Chinese input

Hello, I am having some issues when testing the mGENRE model on Chinese inputs.

When I run the following code (on environment python 3.7.10, torch 1.8.1+cuda11.1):

import pickle
from genre.trie import Trie, MarisaTrie
from genre.fairseq_model import mGENRE

with open("./data/lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
    lang_title2wikidataID = pickle.load(f)

# memory efficient prefix tree (trie) implemented with `marisa_trie`
with open("./data/titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
    trie = pickle.load(f)

# generate Wikipedia titles and language IDs
model = mGENRE.from_pretrained("./models/fairseq_multilingual_entity_disambiguation").eval()

res = model.sample(
    sentences=["物质与其所对应的反物质碰撞后消失称为 [START] 湮灭 [END] 。"],
    # Italian for "[START] Einstein [END] was a German physicist."
    prefix_allowed_tokens_fn=lambda batch_id, sent: [
        e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
    ],
    text_to_id=lambda x: max(lang_title2wikidataID[
        tuple(reversed(x.split(" >> ")))
    ], key=lambda y: int(y[1:])),
    marginalize=True,
)

print(res)

The program will trigger a CUDA device-side assert like:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [161,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

The corresponding traceback is:

  File "test.py", line 24, in <module>
    marginalize=True,
  File "~/GENRE/genre/fairseq_model.py", line 46, in sample
    **kwargs,
  File "~/GENRE/genre/fairseq_model.py", line 96, in generate
    return super(BARTHubInterface, self).generate(*args, **kwargs)
  File "~/fairseq/fairseq/hub_utils.py", line 179, in generate
    generator, self.models, batch, **inference_step_args
  File "~/fairseq/fairseq/tasks/fairseq_task.py", line 502, in inference_step
    models, sample, prefix_tokens=prefix_tokens, constraints=constraints
  File "~/anaconda3/envs/genre/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_co
ntext
    return func(*args, **kwargs)
  File "~/fairseq/fairseq/sequence_generator.py", line 886, in generate
    finalized = super()._generate(sample, **kwargs)
  File "~/fairseq/fairseq/sequence_generator.py", line 242, in _generate
    encoder_outs = self.model.forward_encoder(net_input)
  File "~/fairseq/fairseq/sequence_generator.py", line 757, in forward_encoder
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "~/fairseq/fairseq/sequence_generator.py", line 757, in <listcomp>
    return [model.encoder.forward_torchscript(net_input) for model in self.models]
  File "~/fairseq/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript
    return self.forward_non_torchscript(net_input)
  File "~/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
    return self.forward(**encoder_input)
  File "~/fairseq/fairseq/models/transformer.py", line 440, in forward
    token_embeddings)
  File "~/fairseq/fairseq/models/transformer.py", line 497, in forward_scriptable
    x, encoder_padding_mask=encoder_padding_mask if has_pads else None

However, if I change the word 湮灭 to some other words or test the program with English text, the exception won't occur.
Does the exception happen from a mismatch between vocabulary size and embedding matrix shape? Wish anyone could offer some advice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.