Comments (6)
@JKamlah I am going to implement it according to the PAGE specs, i.e. "take the TextEquiv with the lowest index
(if there are multiple)". This seems to also be correct for your example. (Your code at https://github.com/JKamlah/dinglehopper selects by a user-specified index). Do you see a problem with that I might be missing?
from dinglehopper.
In this example (thanks to @JKamlah!), OCR-D-GT_0008.xml contains corrections in the TextEquivs with the lowest index:
larex-indexed-textequiv-jkamlah.zip
<TextLine id="l2">
<Coords points="301,270 1389,270 1389,306 301,306"/>
<TextEquiv index="0">
<Unicode>
sondere Schrift daraus zu machen. Locke scheint fort-
</Unicode>
</TextEquiv>
<TextEquiv index="1">
<Unicode>
gondere Schrift daraus zu machen. LDocke scheint fortโ-
</Unicode>
</TextEquiv>
</TextLine>
from dinglehopper.
Thank you @mikegerber for the quick response.
Do you see a problem with that I might be missing?
No, not at all. It would perfectly fits our needs. The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.
from dinglehopper.
The only reason to keep the index selection option is comparing the corrected output with original one? A Use-Case would be, if you use ABBYY for OLR reasons and keep the ocr'd text, you can easily compare it with the new recognized text.
I'd suggest keeping the ABBYY results and the manually corrected files in separate file groups and compare those, e.g.
ocrd-dinglehopper -I OCR-ABBYY,OCR-ABBYY-CORRECTED -O OCR-ABBYY-CORRECTED-DIFF -P metrics false
This seems to make it a lot more explicit.
from dinglehopper.
You are absolutely right, it is much more explicit. I mean this is more like a fundamental question or? If i have multiple versions (indexes) in my file, i could have the need to compare them or to compare a specific index to another file. But how often will that happen and should dinglehopper offer an option for these few cases?
from dinglehopper.
There is - to my knowledge - nothing in the PAGE specs that says the index is anything more than a preference order, it just happens that LAREX seems to produce files where we could select by index. Another tool might just add indexes where something changed. So I'll recommend copying the files to a named file group.
As for getting the correct TextEquivs, I have fixed this today and will merge!
from dinglehopper.
Related Issues (20)
- Test cli_line_dirs
- No tests are run for PRs? HOT 3
- Update RapidFuzz again HOT 4
- Monthly scheduled tests HOT 6
- DingleHopper does not create results HOT 2
- Release 1.0.0
- ocrd_utils DeprecationWarning HOT 2
- DeprecationWarning about pkg_resources.declare_namespace HOT 10
- Installing is broken in current master HOT 12
- Review notebook/*.ipynb HOT 1
- Dinglehopper seems to act not coherently when it comes to empty files HOT 15
- Ignore BOM HOT 2
- Review changes of #83
- setup.cfg โ pyproject.toml HOT 3
- Install is now broken HOT 2
- GitHub Actions HOT 1
- Review multimethod dependency
- Regression with newest ocrd version HOT 4
- dinglehopper --version, ocrd-dinglehopper --version HOT 1
- Detect if no text was extracted / if there are grave inconsistencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dinglehopper.