Comments (10)
@cneud, yes, the issue can be solved with substitutions which can be configured by the users.
Exactly.
I aim to support (Unicode terms) canonical equivalence (= using NFC consistently) and maybe the same idea for MUFI characters. And make all further equivalence considerations user configurable. The latter also means that some of the hardcoded substitutions will be moved to some kind of configuration (#11) and @stweil can make uͦ and ů equivalent.
from dinglehopper.
Basically this comes down to a number of pre-defined common use cases or scenarios, with the added possibility for users to create their own scenarios. This is also the approach that was followed during the IMPACT project and which I believe would be a sound tradeoff between comparable results and flexibility with regard to differing applications.
from dinglehopper.
Dinglehopper already uses normalization (NFC). The delta in the example is caused by different characters which look the same: u + COMBINING LATIN SMALL LETTER O != u + COMBINING RING ABOVE.
So COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE should be handled as similar when comparing.
from dinglehopper.
I do think that COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE are entirely different characters and not equivalent.
However,
I do think that the normalization behaviour should be configurable by the user. So if you choose to consider these characters to be the same for your use case, then you should be able to configure this. I've done some WIP on this that I need to merge and work further on it.
from dinglehopper.
Duplicate of #11?
from dinglehopper.
@cneud, yes, the issue can be solved with substitutions which can be configured by the users.
@mikegerber, sure, uͦ
and ů
are different characters, maybe like some cyrillic characters which look like latin ones. But which of them is the right one for historic German texts? Do you agree that it does not make sense to train both for OCR models in that context?
from dinglehopper.
@cneud, yes, the issue can be solved with substitutions which can be configured by the users.
Exactly.
I would like to point out here that allowing arbitrary equivalences also makes comparing much more difficult. Ideally, there should be sensible sets of transformations (like OCR-D GT levels) which many researchers/practitioners can agree on. And then, to facilitate commensurability, ideally, the evaluation should produce multiple metrics next to each other in the report (like: always the chosen metric plus maximum normalization (GT level 1) metric plus minimum (GT level 3 / Levenshtein) metric).
Also, cf. existing metrics in my module.
from dinglehopper.
I agree mostly with what @bertsky and @cneud said. I just want to throw in some doubt on the belief that CERs are somehow comparable when produced by different tools. Do they count whitespace the same way? grapheme clusters? punctuation?
Side note: Is there really a set of transformations defined for OCR-D's GT level 1?
from dinglehopper.
the belief that CERs are somehow comparable when produced by different tools
I too strongly doubt they are! Looking at e.g. results and metrics from ICDAR papers, many resort to their own implementation for evaluation, which obviously creates considerable blur around any exact performance comparison.
from dinglehopper.
I just want to throw in some doubt on the belief that CERs are somehow comparable when produced by different tools. Do they count whitespace the same way? grapheme clusters? punctuation?
We have to get there! As a community. Otherwise, where's the objectivity?
White space should be easy, given the strictly implicit PAGE-XML whitespace model. Grapheme clusters is something we agreed on earlier, only our implementations differ (so it should be interesting to compare them to find edge cases). Punctuation, not sure what you mean – punctuation normalization in CER, or tokenization for WER? (The latter I agree is a hard one to find any one standard for...)
Side note: Is there really a set of transformations defined for OCR-D's GT level 1?
No, unfortunately not. That's one of the things I have been adamant about in phase 2 but never got @tboenig or @kba to implement a runnable definition 😟
from dinglehopper.
Related Issues (20)
- Test cli_line_dirs
- Tests broken, again HOT 4
- Review API w.r.t. to keyword only arguments HOT 1
- Tests failing + Wrong badge HOT 5
- mypy vs pre-commit mypy HOT 2
- newlines not removed in plain_extract HOT 5
- Windows support HOT 6
- Python 3.12 support broken HOT 3
- Confusing "Tests Report" in GH Actions
- Test CLIs
- Test reports broken HOT 3
- Add TEI support HOT 1
- Tests on 3.12 broken HOT 5
- Matching the groundtruth with the hypothesis baselines for CER / WER HOT 1
- Install is now broken HOT 2
- GitHub Actions HOT 1
- Review multimethod dependency
- Regression with newest ocrd version HOT 4
- dinglehopper --version, ocrd-dinglehopper --version HOT 3
- Detect if no text was extracted / if there are grave inconsistencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dinglehopper.