Git Product home page Git Product logo

Comments (6)

nicola-decao avatar nicola-decao commented on May 24, 2024 2

No, the disambiguation model does not need such mapping because you are already specifying the mention.

You should do something like this:

sentences = [
    "[START_ENT] Leonardo [END_ENT] was a painter while Leonardo Di Caprio is an actor",
    "Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is an actor"
]

# generate a set of tries one for each sentence with the correct candidates
# in this case is the same for both sentences but in general is different for each
# sentence in your batch
tries = {
    _id: Trie([
        [2] + model.encode(e)[1:].tolist()
        for e in candidates
    ])
    for _id, candidates in enumerate([
        ["Leonardo Di Caprio", "Leonardo Da Vinci"],
        ["Leonardo Di Caprio", "Leonardo Da Vinci"],
    ])
}

out = model.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: tries[batch_id].get(sent.tolist()),
)

from genre.

paulthemagno avatar paulthemagno commented on May 24, 2024 1

Thanks @nicola-decao can you please tell me how to find more entities in a sentence? I'ven seen the example you posted and I was trying this one:

sentences = ["[START_ENT]Leonardo[END_ENT] was a painter while [START_ENT]Leonardo Di Caprio[END_ENT] is an actor"]

out = model.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)

But the result is:

[[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8849)}],
 [{'text': 'Leonardo Di Cosmo', 'logprob': tensor(-2.1244)}],
 [{'text': 'Leonardo DiCaprio filmography', 'logprob': tensor(-2.2759)}],
 [{'text': 'Leonardo Di Cesare', 'logprob': tensor(-2.5289)}],
 [{'text': 'Léonardo Meindl', 'logprob': tensor(-4.0481)}]]

Maybe I'm doing some mistakes in the declaration of entities. And another question: is possible to add candidates as for the other models, in order to ease the disambiguation of the model? (in this case for example I should have passed Leonardo Di Caprio and Leonardo Da Vinci as possibilities.

Thank you!

from genre.

nicola-decao avatar nicola-decao commented on May 24, 2024 1

The disambiguation model only handles one disambiguation input at a time. So you either pass "[START_ENT] Leonardo [END_ENT] was a ..." or "Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is..".

In addition, you need to put a space between the delimiters and the mention (see my sentences above).

from genre.

paulthemagno avatar paulthemagno commented on May 24, 2024 1

Ok thank you @nicola-decao I think I'm almost there. Just some final questions for my practical use:

  • Is the [2] you put in the Trie depending on the number of sentences I write or is it a fixed encoding to keep?
  • I'm printing the results and I have the following:
[[{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)},
  {'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)}],
 [{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)}],
 [{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)}],
 [{'text': 'Leonardo Da Vinci', 'logprob': tensor(-2.2151)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}],
 [{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}]]

I'm seeing that every time the sample function returns 5 results (even if I put other sentences) so I think that if I put only two possibilities (like Leonardo DiCaprio and Leonardo Da Vinci), they will be "repeated' in the 5 results, am I right?

I also see that, in this case, these 5 possibilities are lists of 2 elements each (since I put two sentences), while if I put only 1 sentence, they are list of 1 element (seems right). How do I have to read this values? I thought that the two sentences were "independent" (or not?).

Indeed the first element of the first result {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)} is higher than the first element of the third result: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}, BUT the second element of the first result {'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)} is lower of the second element of the third result: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)},

So I tried to read the first element and the second of the 5 results separately (as if they are not correlated, don't know if it's the correct thought):

number_of_lables = len(out[0])
labels = {}


for choice in out:
    for i,label in enumerate(choice):
        if i not in labels:
            labels[i] = {"text": label['text'], 'logprob': label['logprob']}
        else:
            if torch.gt(label['logprob'],labels[i]['logprob']):
                labels[i] = {"text": label['text'], 'logprob': label['logprob']}
    
labels

obtaining:

{0: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0639)},
 1: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)}}

and it seemed good! Because in my first sentence I have Leonardo Da Vinci, while in the second one Leonardo DiCaprio.

But adding another row, like this:

sentences = [
    "Leonardo was a painter while [START_ENT] Leonardo Di Caprio [END_ENT] is an actor",
    "[START_ENT] Leonardo [END_ENT] was a painter while Leonardo Di Caprio is an actor",
    "[START_ENT] Brown [END_ENT] was an American singer, songwriter, dancer, musician, record producer, and bandleader"
]

# generate a set of tries one for each sentence with the correct candidates
# in this case is the same for both sentences but in general is different for each
# sentence in your batch
tries = {
    _id: Trie([
        [2] + dmodel.encode(e)[1:].tolist()
        for e in candidates
    ])
    for _id, candidates in enumerate([
        ["Leonardo DiCaprio", "Leonardo Da Vinci"],
        ["Leonardo DiCaprio", "Leonardo Da Vinci"],
        ["Kwame Brown", "James Brown"],
    ])
}

out = dmodel.sample(
    sentences,
    prefix_allowed_tokens_fn=lambda batch_id, sent: tries[batch_id].get(sent.tolist()),
)

I have a mixed result:

[[{'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-2.2151)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)}],
 [{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0638)}],
 [{'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.4777)},
  {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
  {'text': 'Leonardo DiCaprio', 'logprob': tensor(-1.6667e+08)}],
 [{'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.6667e+08)},
  {'text': 'James Brown', 'logprob': tensor(-0.0789)},
  {'text': 'Kwame Brown', 'logprob': tensor(-4.0590)}],
 [{'text': 'Kwame Brown', 'logprob': tensor(-2.0000e+08)},
  {'text': 'James Brown', 'logprob': tensor(-3.3333e+08)},
  {'text': 'James Brown', 'logprob': tensor(-3.3333e+08)}]]

In the third element of the first results I still have Leonardo... while I expected to find only James Brown or Kwame Brown, indeed my script gives a wrong result:

{0: {'text': 'Leonardo DiCaprio', 'logprob': tensor(-0.8483)},
 1: {'text': 'James Brown', 'logprob': tensor(-1.7946)},
 2: {'text': 'Leonardo Da Vinci', 'logprob': tensor(-1.0638)}}

Maybe I have to use one line at time in order to avoid confusion?

Thank you in advance for the explanation, it's starting to work 😊

from genre.

nicola-decao avatar nicola-decao commented on May 24, 2024

The best to do disambiguation would be using the model trained for disambiguation 😅

Look at this to see how it works https://github.com/facebookresearch/GENRE/tree/main/examples_genre#example-entity-disambiguation

from genre.

paulthemagno avatar paulthemagno commented on May 24, 2024

ah ok so I will pass multiple sentences!

So I deduce the hints like:

mention_to_candidates_dict={
        "Leonardo": ["Leonardo Di Caprio", "Leonardo Da Vinci"],
        "Leonardo Di Caprio": ["Leonardo Di Caprio", "Leonardo Da Vinci"],
    }

are not available for the disambiguation model, aren't them?

from genre.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.