Comments (5)
An interesting more recent addition in the scientific community is the Flexible character accuracy measure.
As I have a similar problem and need a solution I will try to integrate the Flexible Character Accuracy as option for Dinglehopper.
from dinglehopper.
I'm leaning towards providing the UWER (unordered word error rate) in dinglehopper to resolve this.
Thoughts:
- I don't think a layout analysis feature - which reordering the paragraphs is - would be appropriate here in an evaluation tool. If there's a simple algorithm that solves most issues, there should a separate tool to do this in the OCR-D community.
- Just trying all permutations of paragraphs is IMHO no good as this would be on the order of O(m!) for m paragraphs
from dinglehopper.
I must agree, calculation of reliable accuracy rates with wrong segmentation order is beyond the possibilities of dinglehopper
. The sheer amount of possible segmentation classes/errors is escalating way too quickly!
As always when it comes to the topic of evaluation, the PRImA group have some good publications about this, e.g. The Significance of Reading Order in Document Recognition and its Evaluation and Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods.
The typical solution for this adopted in other evaluation tools is to include the Bag-of-words (BOW) metric, which is easy to compute and could probably be supported by dinglehopper
too.
An interesting more recent addition in the scientific community is the Flexible character accuracy measure.
from dinglehopper.
Flew through this paper. Does compare strings of GT with substrings of OCR (in case of erroneously joined columns).
( I assume the "equal-length distance" editDist(..., substr(..., t2.length ))
is because of runtime considerations, but in theory this is does not need to be same length; I would suggest word boundaries. )
I'll still think such a flexible comparison is essential - before runnning ocr-d in production - to verify the workflows in use.
from dinglehopper.
Flew through this paper. Does compare strings of GT with substrings of OCR (in case of erroneously joined columns).
Simplified Explanation: FCA compares a line from GT with all lines from OCR and either finds a satisfying match or splits the GT line into smaller fragments based on the best match found. There are more steps and some implementation details only visible in the Java Implementation.
( I assume the "equal-length distance"
editDist(..., substr(..., t2.length ))
is because of runtime considerations, but in theory this is does not need to be same length; I would suggest word boundaries. )
I am confused by your mentioning of the "equal-length distance"... maybe you confuse it with the splitting of lines into smaller fragments?
from dinglehopper.
Related Issues (20)
- Test cli_line_dirs
- Tests broken, again HOT 4
- Review API w.r.t. to keyword only arguments HOT 1
- Tests failing + Wrong badge HOT 5
- mypy vs pre-commit mypy HOT 2
- newlines not removed in plain_extract HOT 5
- Windows support HOT 5
- Python 3.12 support broken HOT 3
- Confusing "Tests Report" in GH Actions
- Test CLIs
- Test reports broken HOT 3
- Add TEI support
- Review changes of #83
- setup.cfg → pyproject.toml HOT 3
- Install is now broken HOT 2
- GitHub Actions HOT 1
- Review multimethod dependency
- Regression with newest ocrd version HOT 4
- dinglehopper --version, ocrd-dinglehopper --version HOT 1
- Detect if no text was extracted / if there are grave inconsistencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dinglehopper.