Comments (6)
No, the disambiguation model does not need such mapping because you are already specifying the mention.
You should do something like this:
sentences = [
"[START_ENT] Leonardo [END_ENT] was a painter while Leonardo Di Caprio is an actor",
"Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is an actor"
]
# generate a set of tries one for each sentence with the correct candidates
# in this case is the same for both sentences but in general is different for each
# sentence in your batch
tries = {
_id: Trie([
[2] + model.encode(e)[1:].tolist()
for e in candidates
])
for _id, candidates in enumerate([
["Leonardo Di Caprio", "Leonardo Da Vinci"],
["Leonardo Di Caprio", "Leonardo Da Vinci"],
])
}
out = model.sample(
sentences,
prefix_allowed_tokens_fn=lambda batch_id, sent: tries[batch_id].get(sent.tolist()),
)
from genre.
Thanks @nicola-decao can you please tell me how to find more entities in a sentence? I'ven seen the example you posted and I was trying this one:
sentences = ["[START_ENT]Leonardo[END_ENT] was a painter while [START_ENT]Leonardo Di Caprio[END_ENT] is an actor"]
out = model.sample(
sentences,
prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
But the result is:
[[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8849)}],
[{'text': 'Leonardo Di Cosmo', 'logprob': tensor(-2.1244)}],
[{'text': 'Leonardo DiCaprio filmography', 'logprob': tensor(-2.2759)}],
[{'text': 'Leonardo Di Cesare', 'logprob': tensor(-2.5289)}],
[{'text': 'Léonardo Meindl', 'logprob': tensor(-4.0481)}]]
Maybe I'm doing some mistakes in the declaration of entities. And another question: is possible to add candidates as for the other models, in order to ease the disambiguation of the model? (in this case for example I should have passed Leonardo Di Caprio
and Leonardo Da Vinci
as possibilities.
Thank you!
from genre.
The disambiguation model only handles one disambiguation input at a time. So you either pass "[START_ENT] Leonardo [END_ENT] was a ..." or "Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is..".
In addition, you need to put a space between the delimiters and the mention (see my sentences above).
from genre.
Ok thank you @nicola-decao I think I'm almost there. Just some final questions for my practical use:
- Is the
[2]
you put in the Trie depending on the number of sentences I write or is it a fixed encoding to keep? - I'm printing the results and I have the following:
[[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)},
{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)}],
[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)}],
[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)}],
[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-2.2151)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}],
[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}]]
I'm seeing that every time the sample
function returns 5 results (even if I put other sentences) so I think that if I put only two possibilities (like Leonardo DiCaprio
and Leonardo Da Vinci
), they will be "repeated' in the 5 results, am I right?
I also see that, in this case, these 5 possibilities are lists of 2 elements each (since I put two sentences), while if I put only 1 sentence, they are list of 1 element (seems right). How do I have to read this values? I thought that the two sentences were "independent" (or not?).
Indeed the first element of the first result {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)}
is higher than the first element of the third result: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}
, BUT the second element of the first result {'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)}
is lower of the second element of the third result: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)}
,
So I tried to read the first element and the second of the 5 results separately (as if they are not correlated, don't know if it's the correct thought):
number_of_lables = len(out[0])
labels = {}
for choice in out:
for i,label in enumerate(choice):
if i not in labels:
labels[i] = {"text": label['text'], 'logprob': label['logprob']}
else:
if torch.gt(label['logprob'],labels[i]['logprob']):
labels[i] = {"text": label['text'], 'logprob': label['logprob']}
labels
obtaining:
{0: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)},
1: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)}}
and it seemed good! Because in my first sentence I have Leonardo Da Vinci
, while in the second one Leonardo DiCaprio
.
But adding another row, like this:
sentences = [
"Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is an actor",
"[START_ENT] Leonardo [END_ENT] was a painter while Leonardo Di Caprio is an actor",
"[START_ENT] Brown [END_ENT] was an American singer, songwriter, dancer, musician, record producer, and bandleader"
]
# generate a set of tries one for each sentence with the correct candidates
# in this case is the same for both sentences but in general is different for each
# sentence in your batch
tries = {
_id: Trie([
[2] + dmodel.encode(e)[1:].tolist()
for e in candidates
])
for _id, candidates in enumerate([
["Leonardo DiCaprio", "Leonardo Da Vinci"],
["Leonardo DiCaprio", "Leonardo Da Vinci"],
["Kwame Brown", "James Brown"],
])
}
out = dmodel.sample(
sentences,
prefix_allowed_tokens_fn=lambda batch_id, sent: tries[batch_id].get(sent.tolist()),
)
I have a mixed result:
[[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-2.2151)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}],
[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0638)}],
[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)},
{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)}],
[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
{'text': 'James Brown', 'logprob': tensor(-0.0789)},
{'text': 'Kwame Brown', 'logprob': tensor(-4.0590)}],
[{'text': 'Kwame Brown', 'logprob': tensor(-2.0000e+08)},
{'text': 'James Brown', 'logprob': tensor(-3.3333e+08)},
{'text': 'James Brown', 'logprob': tensor(-3.3333e+08)}]]
In the third element of the first results I still have Leonardo...
while I expected to find only James Brown
or Kwame Brown
, indeed my script gives a wrong result:
{0: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)},
1: {'text': 'James Brown', 'logprob': tensor(-1.7946)},
2: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0638)}}
Maybe I have to use one line at time in order to avoid confusion?
Thank you in advance for the explanation, it's starting to work 😊
from genre.
The best to do disambiguation would be using the model trained for disambiguation 😅
Look at this to see how it works https://github.com/facebookresearch/GENRE/tree/main/examples_genre#example-entity-disambiguation
from genre.
ah ok so I will pass multiple sentences!
So I deduce the hints like:
mention_to_candidates_dict={
"Leonardo": ["Leonardo Di Caprio", "Leonardo Da Vinci"],
"Leonardo Di Caprio": ["Leonardo Di Caprio", "Leonardo Da Vinci"],
}
are not available for the disambiguation model, aren't them?
from genre.
Related Issues (20)
- is prefix_allowed_tokens_fn only working for seq2seq model.generate? HOT 2
- Loading mgenre models is taking 44GB RAM
- Problem in candidate-based generation on GENRE using transformers >= 4.36.0
- the same entity name question
- Inference speed is too slow. Is this problem because of Constrained beam search?
- can not receive different outputs from mGENRE.sample using dropout in train mode and different seeds HOT 2
- can't find ID to title map json file HOT 1
- alignment between candidate and KILT wikipedia data source HOT 4
- Question: Running genre on multiple GPUs HOT 1
- format of entries for entity linking training HOT 2
- Invalid prediction - no wikipedia entity HOT 10
- Fail to Reproduce the dev score of GENRE Document Retrieval HOT 7
- mGENRE finetuning issue
- Why do you prepend `eos_token_id' to sent_orig HOT 2
- colab script to run GENRE
- NameError: name 'batched_hypos' is not defined (mGENRE) HOT 5
- [Question] Evaluating mGENRE on Mewsli-9
- Fine-tune with hugging face trainer
- import package error
- Chinese entity linking
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genre.