Comments (10)
Sure! The real data is proprietary with many columns of metadata, but for my use case, I can shrink to a sample.
LEXEME_FORM, SENSE
"archer", "ninth sign of zodiac in astrology"
"artist", "skilled in art"
"drowned", "died by drowning"
"saving", "rescuing, preserving"
After reconciling the LEXEME_FORM against Forms in Wikidata, the RECONCILED_TO_FORM would be the expected exact match form in the "en" language. The SENSE column is one of those metadata columns, and would not be reconciled, but compared to Senses retrieved from the parent Lexeme, for ex. for the 1st row, comparing to L29863 Senses and then I would choose the matching Sense, or if Sense not found, would choose to create New one.
I put the reconciled id into a new column so it's easier for you to build a test case against or reformat.
"saving" and "archer" are the outlier cases as you'll see when querying and part of the NLP discovery mechanisms for "missing Sense in Wikidata". :-)
LEXEME_FORM, SENSE, RECONCILED_TO_FORM
"archer", "ninth sign of zodiac in astrology", "L29863-F1"
"artist", "skilled in art", "L6357-F1"
"drowned", "died by drowning", "L12156-F3"
"saving", "rescuing, preserving", "L42004-F1"
from openrefine-wikibase.
@thadguidry example is very good but here is a different and maybe more simpler example (could be useful for a first degraded prototype?).
Reconcile on lemmata alone.
All lexemes have one lemma (at least one - a few have several - stored in RDF with wikibase:lemma
). So for instance, a simple first step is to reconcile :
- first -> L2, L46028, L333590
- magic -> L3, L338238, L587694
(I took on purpose examples with multiples homographs - identical string as lemma - as possible values)
If then we can use this reconciled values to gather more info (mainly the language of the lexeme dct:language
and the lexical category wikibase:lexicalCategory
- which also are always present one time and only one time on Lexemes) we could distinguish homographs and there it could be already very useful (to add a dictionary identifier for instance or to add a grammatical gender P5185
, for instance we know that all French noun ending in -ment are masculine).
Having the senses (ontolex:sense
) and the forms (ontolex:lexicalForm
) would be ideal but it could wait for a next step on the roadmap.
from openrefine-wikibase.
Related: #42
from openrefine-wikibase.
Hi @wetneb
I'm not currently interested in writing new lexemes to Wikidata, but only reconciling for now.
I have a column of words and want to constrain reconciling to only the Lexeme namespace.
-
How quickly could Lexeme lookup be added for this issue? Days, Weeks, Months?
-
If it's Months, what's involved in changing it quickly for my use case? Could the Wikidata reconcile endpoint be changed for this? Or could I just get this quickly, by running a Docker instance with the customized reconcile endpoint I need, tweaking the Python code in necessary places?
-
Could I also be able to constrain the Suggest Flyout to the Lexeme namespace by only tweaking
suggest.py
, anything else?
from openrefine-wikibase.
I'd say a few days of work should be enough for the implementation itself, but I haven't really thought about how this should work from a user perspective. Should other namespaces be supported by the current reconciliation endpoint, or should it be a different endpoint altogether? Does it make sense to have fuzzy-matching for lexemes at all? How to deal with forms and senses?
I'd only be convinced that we got this right if I start thinking about doing a data import in lexemes and understand the needs from this perspective. Otherwise it's easy to have something that "supports" lexemes but doesn't actually adress the needs of the community.
from openrefine-wikibase.
I'd say a few days of work should be enough for the implementation itself, but I haven't really thought about how this should work from a user perspective.
- Great!
Should other namespaces be supported by the current reconciliation endpoint, or should it be a different endpoint altogether?
Does it make sense to have fuzzy-matching for lexemes at all? How to deal with forms and senses?
- The UI needs to be improved to allow users to choose the right Sense if it's available, showing the description of the sense.
- dealing with Lexemes requires more options exposed to a user.
I'd only be convinced that we got this right if I start thinking about doing a data import in lexemes and understand the needs from this perspective. Otherwise it's easy to have something that "supports" lexemes but doesn't actually adress the needs of the community.
- Agree, for my use case, I want to perform "exact matches" against the Form in a given language. But there's more to it with options. Some will care about reconciling against the Sense, and others will want reconciling against the Form (like me, regardless of the Sense which might not be known depending on the dataset a user is reconciling against.)
- I think it makes sense to put an "[EPIC] -" prefix on this issue, sense it's a lot of work and questions actually beyond my simple use case.
from openrefine-wikibase.
Feel free to describe your own use case more in detail, so that we have at least one data point!
from openrefine-wikibase.
My use case is:
- As a user, I want to reconcile a column of Lexeme Forms against Wikidata's Lexeme Forms.
- I then would like to augment the Lexeme Forms by using the Lexeme Form Id's to query Wikidata for all the Lexeme's Sense IDs and Descriptions.
- I will then use NLP tools to help determine the right Sense ID that I need for my rows (comparing to other columns data, etc. since this is harder to do directly in OpenRefine.
from openrefine-wikibase.
Thanks! Could you perhaps give a sample of the data you want to reconcile, and the expected results (corresponding lexemes / forms / senses)?
from openrefine-wikibase.
OpenRefine has adopted the architecture of one reconciliation service per entity type, so it makes sense to keep this web service for items only. Other reconciliation services can be implemented separately.
from openrefine-wikibase.
Related Issues (20)
- Document deployment with gunicorn HOT 1
- app uses debug mode in production HOT 3
- Allow exclusion of classes/types from the reconciliation process HOT 1
- Exception caught: java.lang.NullPointerException HOT 10
- Add custom wikibase from WBstack HOT 1
- Some items are not found by reconciliation ... maybe due to a large number of similar items? HOT 2
- Reconcilation fails when specifying type work (Q386724) - sometimes giving 502 response HOT 10
- Differences in results returned with suggest.py and wbsearchentities HOT 2
- "Only the best rank" in "Add columns from reconciled columns" should not output deprecated values
- failed-save error when creating new items on Wikidata HOT 1
- Data extension: sitelinks (Wikidata) are sometimes not shown/extracted, while they do exist
- Occasional error messages in hover popup for Wikidata reconciliation suggestions HOT 1
- Latest Jinja causing Docker build to fail. HOT 2
- Custom manifest and IRI url HOT 1
- Problem setting up service with docker compose HOT 7
- Wikidata reconciliation query/scores when multiple variable on same property and date precision HOT 19
- Should empty properties scores to 0? HOT 4
- Curious response to property proposal request
- Property values being cached for longer than they should HOT 2
- This repository has migrated!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openrefine-wikibase.