Comments (15)
ah yeah these are all interesting ambiguities, a colon at the beginning of the line is used both by whitespace stuff in our cg format and suggestions now, the comma in second field is used to separate suggestions (though probably not empty strings) and spaces in lemma make field counting a bit ambiguous.
so, the current work arounds are a bit hacky.
from giella-core.
The problem with compounds seems to be when one analysis renders a ?
, while the other renders the expected output:
:\n
"<Gållelibjes>"
"libjes" N Sem/Dummytag Sg Nom
"gålle" N Sem/Mat Cmp/SgGen Cmp
gålle+N+Cmp/SgGen+Cmp#libjes+N+Sg+Nom ?
"libjes" N Sem/Dummytag Sg Nom
"gålle" N Sem/Mat Cmp/SgNom Cmp
gålle+N+Cmp/SgNom+Cmp#libjes+N+Sg+Nom Gållelibjes
"<:>"
":" CLB
:+CLB :
:
This leads to the following final text:
gålle+N+Cmp/SgGen+Cmp#libjes+N+Sg+NomGållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.
Instead of the expected:
Gållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.
from giella-core.
Here's an extreme case of the duplicate compound bug:
Bjørn Olav Megard le låhkåm cand.polit. sosialantropologijja oajvvefágajn Oslo universitehtas.
is turned into:
Bjørn Olav Megard le låhkåm cand.polit.cand.polit socialantropologidjasocialantropologiddjasocialantropologidjasocialantropologiddja oajvvefágajn Oslo universitehtas.
😀
It seems that in this case, none of the generated forms are identical to the input form (because the input is a NO form, and all output forms are SE forms), and the output forms differ among them. In such a case, we just go for the first one, we don't have the time to do anything more advanced.
from giella-core.
Might need more advanced parser or fix on divvun-suggest side for some of these cases where there's multiple different results, before the awk becomes too unwieldy... For quick patch maybe vislcg's -1
option for picking 1-random to tie break is good enough?
from giella-core.
The vislcg
solution will only fix things on the analysis side. The problem is that the generation of the new word forms is ambiguous. And I don't know how to fix this quickly. Adding an option to divvun-suggest
to unique the output would help, but is not enough. Also a flag on stderr to warn about unresolved ambiguity would help, there are not that many.
Any suggestion welcome 🙂
from giella-core.
Actually, it seems to be only two cases left:
cand.polit.
, where the output matches the input, but is probably confused by the final full stop; an adjustment of the existing code should be enoughsociala...
, where none of the outputs match the input; as a last resort just go with what is at hand / is easiest
Since I don't know awk, I have no idea how hard this would be.
from giella-core.
I actually couldn't get cg-proc -1
to work anyways 😊
it's maybe less spaghetti than it might have been but the code tracks now kind of one output per cohort, which is not the best, but actually to control it cg rules might be the way to go.
For cand.polit. the variants are I think with and without a full stop and maybe one with :v at the end if I read correctly?
from giella-core.
With the changes in 63adeef it is almost perfect. What is still left are some cases where the awk script does not choose the word form that is identical to the input form even when there is such a word form. Example input:
Jus muhtema mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.
Earlier output (with multiple forms concatenated):
Jus muhtemuhtema mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.
New output after the latest fix:
Jus muhte mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.
Expected: since the input form is muhtema
, and that word form is among the generated word forms, I had expected that one to be selected.
from giella-core.
Another error:
Äládus- ja oasesdepartementaoasesdepartemænnta
The æ in "Æládus" has become ä, but not in last part of compound, "departemænnta"
from giella-core.
Gehtjav makkár {åvddånbuktemvuohke} l gåvån = Gehtjav makkár {} l gåvån
åvddånbuktemvuohke e err/cmp.
from giella-core.
Anne Silja l aj åvdep giese journalisstan barggam {NRK} Sámeradio åvdås= Anne Silja l aj åvdep giese journalisstan barggam {NRK:A} Sámeradio åvdås
from giella-core.
Dán jagásj Bårjjåsin li guokta {vuostasj} artihkkala = Dán jagásj Bårjjåsin li guokta {vuostas} artihkkala
from giella-core.
Ja {gájkka} galggá sámegiellaj dáhpáduvvat = Ja {gájkav} galggá sámegiellaj dáhpáduvvat
from giella-core.
In retrospect, the processing should have been different:
- grab original word form by default
- grab (first) generated word form only when cohort contains the target tag
Ie keep everything as original unless there is reason to do otherwise.
We might still want to do this for future work for other languages and contexts.
from giella-core.
mm, a long term solution should definitely go in a one of the applications that can process whole cohorts and sentences with all information intact. But this prototype will be very useful for design we should collect and categorise the problems to keep in mind.
from giella-core.
Related Issues (20)
- `lexc-giella-style.py` fails in several ways HOT 1
- Speller error model built from typos list
- Bygging kræsjer i tools/grammarcheckers/filters HOT 7
- Kan ikke bygge "analyser-tts-gt-output.hfst" i lang-sme/-smn/-nob HOT 1
- Byggefeil i samband med "mt-sigma.txt" i lang-sma/-smj HOT 3
- Byggefeil i samband med `scripts/iso-639-3.readme.txt' HOT 1
- Improve GitHub push event posts in Zulip HOT 6
- Installing doesn't adjust generated paths HOT 1
- :sparkles: **Transfer Bugzilla's** :sparkles:
- The Improve Build Infra Project!
- [dicts] Add support for links inside <re>
- [dicts] clean up DTDs, add/improve documentation HOT 1
- Use pre-commit for linting and code consistency? HOT 2
- speller Levenshtein manipulations are ignored by hfst-ospell HOT 1
- Relax dependency on `kbdgen` for mobile spellers HOT 2
- Empty sub-item in in-source documentation under tools / grammarcheckers
- mob_spellercorpus.unitweight.txt missing HOT 2
- `make clean` does not remove all generated files HOT 2
- Ampersand in the end of lexicon names breaks @CODE@ formatting HOT 2
- yaml paradigm tests do not give summary
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from giella-core.