Git Product home page Git Product logo

Comments (17)

hickst avatar hickst commented on August 30, 2024

Isn't a gazetteer pretty much what the KB lookup tables and the lookup routines are?

from processors.

myedibleenso avatar myedibleenso commented on August 30, 2024

What we're aiming for is something more like any entry in file X would be assigned the label Y. So we could detect everything in our Uniprot KB and assign it the label Protein, but grounding those would have to be something separate.

From the gazetteer we'd build an efficient state machine for matching.

We haven't discussed the syntax this would use, but perhaps something like this:

name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
  [gazetteer=/./]+ [word=pasta]?

where the contents of pastas.txt would look like:

linguini
angel hair
ravioli
...
gnocchi

from processors.

hickst avatar hickst commented on August 30, 2024

Nice example. But, if you're really looking up Proteins, Gene, Families, etc. then it's worth pointing out that these are already in memory in an efficient data structure (a Map). So an initial quick-and-dirty gazetteer function could be an existance test for the lookup value in the key set of the appropriate KB.

If you will create a more general purpose gazetteer and all you need is an existance test of the value, then I suggest a Trie structure for each gazetteer; a compact representation with quick, efficient lookup and no need for state-machine machinery.

from processors.

myedibleenso avatar myedibleenso commented on August 30, 2024

This issue is for a general purpose enhancement to Odin (gazetteer-based matchers). It may never be used in REACH.

As far as the implementation, we were considering a trie but we are still undecided given some of our requirements. Thanks for the suggestion though.

from processors.

MihaiSurdeanu avatar MihaiSurdeanu commented on August 30, 2024

The RuleNER already uses a trie that I optimized for our purposes. See
e.a.s.struct.HashTrie.
And yes, it would be nice to have Odin support for this.

On Thu, Jun 16, 2016 at 6:20 AM, Gus Hahn-Powell [email protected]
wrote:

This issue is for a general purpose enhancement to Odin (gazetteer-based
matchers). It may never be used in REACH.

As far as the implementation, we were considering a trie but we are still
undecided given some of our requirements. Thanks for the suggestion though.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#66 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABH-zrwq5ABiUtkqN0_7limzknT4sYZhks5qMMDvgaJpZM4I25i1
.

from processors.

marcovzla avatar marcovzla commented on August 30, 2024

Maybe we should explore algorithms like Aho–Corasick, Commentz-Walter, and Rabin–Karp that are designed for multiple string matching. It may be worthwhile since we will be doing this a lot.

There is also http://www.tgries.de/agrep/doc/agrep2ps.zip

from processors.

hickst avatar hickst commented on August 30, 2024

From Gus's example I assumed that we were just doing a lookup of a single literal string. If that's the case, the Trie isn't doing multiple string matching and should be very fast. If I remember right, worst case on the order of O(n) where n is the string length and on average something like logK(avg string length) where K is the average branching factor (related to the size of the character set in use).

Anyway, it sounds like a thing that would be assigned to the programmer so that you guys can be freed up to do research. ;)

from processors.

marcovzla avatar marcovzla commented on August 30, 2024

The idea is to search for all occurrence of all the patterns in the gazetteer inside the document.
The same gazetteer would be applied over multiple documents. This has been studied a lot and I believe we should at least have a reason to reject those algorithms. I believe that Aho-Corasick uses a trie internally, but there is more to it than just that. And if we settle with just using a trie, wouldn't a dawg be more memory efficient?

from processors.

marcovzla avatar marcovzla commented on August 30, 2024

dawg is more commonly referred to as dafsa: https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

from processors.

hickst avatar hickst commented on August 30, 2024

Thanks for the clarification: I misunderstood the intention of the new gazetteer as a simple lookup. I see now why those other algorithms would be relevant and important to investigate.

from processors.

hickst avatar hickst commented on August 30, 2024

Re DAWG: yes, that makes sense:
"By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly related trie data structure."

from processors.

marcovzla avatar marcovzla commented on August 30, 2024

This is a nice algorithm for constructing the DAFSA incrementally.
http://www.aclweb.org/anthology/J00-1002

from processors.

myedibleenso avatar myedibleenso commented on August 30, 2024

Based on discussions at the event in Portland today, I feel this is actually very important for getting started with new domains. We will implement an initial version over the next few weeks.

I am going to go ahead and impose a deadline for the first version as the end of June. Notice it has been nearly a year since we last had activity on this issue!

from processors.

myedibleenso avatar myedibleenso commented on August 30, 2024

Implementation said, what should usage of a gazetteer look like? Previously I introduced this example:

name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
  [gazetteer=/./]+ [word=pasta]?

If the use case is simply to see if a token is in the specified gazetteer, I'm not sure [gazetteer=/./] is the way to go. Something like[gazetteer=/^ling/] can be handled with an additional token constraint (ex. & lemma=/^ling/). I think we really just need to assert membership, but [gazetteer=true] seems weird to me.

Here is another idea:

resources:
  - gazetteers:
     # gazetteer name -> path/to/resource
     pastas: path/to/pastas.txt
     cities: path/to/cities.txt
     states: path/to/states.txt

rules:
  - name: "gazetteer-example-1"
    label: Pasta
    type: token
    pattern: |
      [gazetteer(pastas) & lemma=/^lin/] 
      [word=pasta]?

  - name: "gazetteer-example-2"
    label: Location
    type: token
    pattern: |
      [gazetteer(cities) | gazetteer(states)]+

Thoughts?

from processors.

myedibleenso avatar myedibleenso commented on August 30, 2024

I assume the gazetteer should match against a token's word attribute.

Should gazetteer-based matching be case sensitive or not?

from processors.

MihaiSurdeanu avatar MihaiSurdeanu commented on August 30, 2024

@myedibleenso: We are now closer to this, due to the new LexiconNER class. We need to plug this in Odin though.

from processors.

kwalcock avatar kwalcock commented on August 30, 2024

Did this ever happen?

from processors.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.