Git Product home page Git Product logo

Comments (10)

thadguidry avatar thadguidry commented on June 14, 2024 2

Sure! The real data is proprietary with many columns of metadata, but for my use case, I can shrink to a sample.

LEXEME_FORM, SENSE
"archer", "ninth sign of zodiac in astrology"
"artist", "skilled in art"
"drowned", "died by drowning"
"saving", "rescuing, preserving"

After reconciling the LEXEME_FORM against Forms in Wikidata, the RECONCILED_TO_FORM would be the expected exact match form in the "en" language. The SENSE column is one of those metadata columns, and would not be reconciled, but compared to Senses retrieved from the parent Lexeme, for ex. for the 1st row, comparing to L29863 Senses and then I would choose the matching Sense, or if Sense not found, would choose to create New one.

I put the reconciled id into a new column so it's easier for you to build a test case against or reformat.
"saving" and "archer" are the outlier cases as you'll see when querying and part of the NLP discovery mechanisms for "missing Sense in Wikidata". :-)

LEXEME_FORM, SENSE, RECONCILED_TO_FORM
"archer", "ninth sign of zodiac in astrology", "L29863-F1"
"artist", "skilled in art", "L6357-F1"
"drowned", "died by drowning", "L12156-F3"
"saving", "rescuing, preserving", "L42004-F1"

from openrefine-wikibase.

belett avatar belett commented on June 14, 2024 1

@thadguidry example is very good but here is a different and maybe more simpler example (could be useful for a first degraded prototype?).

Reconcile on lemmata alone.
All lexemes have one lemma (at least one - a few have several - stored in RDF with wikibase:lemma). So for instance, a simple first step is to reconcile :

  • first -> L2, L46028, L333590
  • magic -> L3, L338238, L587694
    (I took on purpose examples with multiples homographs - identical string as lemma - as possible values)

If then we can use this reconciled values to gather more info (mainly the language of the lexeme dct:language and the lexical category wikibase:lexicalCategory - which also are always present one time and only one time on Lexemes) we could distinguish homographs and there it could be already very useful (to add a dictionary identifier for instance or to add a grammatical gender P5185, for instance we know that all French noun ending in -ment are masculine).

Having the senses (ontolex:sense) and the forms (ontolex:lexicalForm) would be ideal but it could wait for a next step on the roadmap.

from openrefine-wikibase.

wetneb avatar wetneb commented on June 14, 2024

Related: #42

from openrefine-wikibase.

thadguidry avatar thadguidry commented on June 14, 2024

Hi @wetneb

I'm not currently interested in writing new lexemes to Wikidata, but only reconciling for now.
I have a column of words and want to constrain reconciling to only the Lexeme namespace.

  1. How quickly could Lexeme lookup be added for this issue? Days, Weeks, Months?

  2. If it's Months, what's involved in changing it quickly for my use case? Could the Wikidata reconcile endpoint be changed for this? Or could I just get this quickly, by running a Docker instance with the customized reconcile endpoint I need, tweaking the Python code in necessary places?

  3. Could I also be able to constrain the Suggest Flyout to the Lexeme namespace by only tweaking suggest.py, anything else?

from openrefine-wikibase.

wetneb avatar wetneb commented on June 14, 2024

I'd say a few days of work should be enough for the implementation itself, but I haven't really thought about how this should work from a user perspective. Should other namespaces be supported by the current reconciliation endpoint, or should it be a different endpoint altogether? Does it make sense to have fuzzy-matching for lexemes at all? How to deal with forms and senses?

I'd only be convinced that we got this right if I start thinking about doing a data import in lexemes and understand the needs from this perspective. Otherwise it's easy to have something that "supports" lexemes but doesn't actually adress the needs of the community.

from openrefine-wikibase.

thadguidry avatar thadguidry commented on June 14, 2024

I'd say a few days of work should be enough for the implementation itself, but I haven't really thought about how this should work from a user perspective.

  • Great!

Should other namespaces be supported by the current reconciliation endpoint, or should it be a different endpoint altogether?

Does it make sense to have fuzzy-matching for lexemes at all? How to deal with forms and senses?

  • The UI needs to be improved to allow users to choose the right Sense if it's available, showing the description of the sense.
  • dealing with Lexemes requires more options exposed to a user.

I'd only be convinced that we got this right if I start thinking about doing a data import in lexemes and understand the needs from this perspective. Otherwise it's easy to have something that "supports" lexemes but doesn't actually adress the needs of the community.

  • Agree, for my use case, I want to perform "exact matches" against the Form in a given language. But there's more to it with options. Some will care about reconciling against the Sense, and others will want reconciling against the Form (like me, regardless of the Sense which might not be known depending on the dataset a user is reconciling against.)
  • I think it makes sense to put an "[EPIC] -" prefix on this issue, sense it's a lot of work and questions actually beyond my simple use case.

from openrefine-wikibase.

wetneb avatar wetneb commented on June 14, 2024

Feel free to describe your own use case more in detail, so that we have at least one data point!

from openrefine-wikibase.

thadguidry avatar thadguidry commented on June 14, 2024

My use case is:

  • As a user, I want to reconcile a column of Lexeme Forms against Wikidata's Lexeme Forms.
    • I then would like to augment the Lexeme Forms by using the Lexeme Form Id's to query Wikidata for all the Lexeme's Sense IDs and Descriptions.
    • I will then use NLP tools to help determine the right Sense ID that I need for my rows (comparing to other columns data, etc. since this is harder to do directly in OpenRefine.

from openrefine-wikibase.

wetneb avatar wetneb commented on June 14, 2024

Thanks! Could you perhaps give a sample of the data you want to reconcile, and the expected results (corresponding lexemes / forms / senses)?

from openrefine-wikibase.

wetneb avatar wetneb commented on June 14, 2024

OpenRefine has adopted the architecture of one reconciliation service per entity type, so it makes sense to keep this web service for items only. Other reconciliation services can be implemented separately.

from openrefine-wikibase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.