stscoundrel / old-danish-dictionary-builder Goto Github PK
View Code? Open in Web Editor NEWBuild "Dictionary of the Old Danish Language" into easier-to-use data formats
License: MIT License
Build "Dictionary of the Old Danish Language" into easier-to-use data formats
License: MIT License
The final output contains 8 empty headwords. They do have definitions though.
See whats going on & if they should just be part of the previous entry. Would be easiest fix that way, should it make sense.
Some headwords are split into two rows in the book. They should be recognized parsed back together.
For example, Axelvej
ends up undetected, as while it does start with promising regex, it ends in dash instead of comma. We'll have to add the dashful variant to regexes & do additional parsing for these entries whose headword is partial.
Sometimes OCR fails to detect the column breaker. Instead, it is generally something like four spaces along the middle of the line.
If a line is missing the divider, see if middle contains such arrangement of spaces & use it instead.
Due to mistake in letterPages
list, the downloader downloads letters K and Æ.
Headword Ymte,
causes false positive in headword parsing, which is meant to combine headwords that are broken into two lines. The headword ends up as Ymte,hafide
.
Ensure only headwords that end in dash get this combining.
Some headwords are incorrectly OCR'd. While some text/definition content probably is too, the issue is much more glaring in headwords. Examples:
While it is hard or impossible to detect all of these, we could add some "proofreader" mapping for headwords, which could be updated as we face these.
Should we want some automatic detection, we could check if alphabetic order for entries matches. For example, one page has:
"Azelkøbstad",
"Axelskav",
"Axeltorg",
"Axelseng",
"Axeltand",
"Azelvej",
"Axel",
"Axelmærke",
"Axeniere",
Which does easily reveal which ones are incorrectly read. But then again, they also could as easily be misread in a way that is still alphabetically valid, especially if given letter only has few entires.
If a page starts in middle of definition, recognize that. Currently naively expects pages to start with an entry & treats partial entry as such.
Current PAGES_TO_IRREGULAR_META_LINE_INDEXES
mapping contains notes about some pages whose meta lines seemed irregular / might break parsing. Add test cases for them. Some of the TODO's are duplicates, so one might want to first go through the listed files and see the variations.
Probably needs something like:
For example: page 2387-røtte (rotte).txt
has two undetected headwords. Both headwords are "Røve", but they are OCR'd as "Bøve". No easy way to regex our way out of that.
We could add known typos/replaces and run them to data prior to entry detection or something to similar effect.
Some entries have multiple definitins like 1) foo bar 2) bar baz etc.
Few of those are incorrectly read, should definition have something that looks like a headword. Lets run a script that outputs entries that look suspicious in that regard. If problematic ones are find (=should not be their own entries), add them to false positives map.
There are some OCR errors in meta lines, which means letters in page may be misread. Example:
71-arbejdelse.txt
has only entries starting with "A". After OCR, it thinks it has "A" and "Å".
That could be remedied with sanity check, like "can Å be the next letter after A". But then again, "A" could also be misread.
How about:
Example: AzelKøbstad
should be Azelkøbstad
. As it is in two lines, the OCR seemed to deem the letter k as capital.
It is unlikely that there are too many actual uppercase letters within headwords. We could just lowercase it, unless more edge cases appear.
Note: the actual word would be Axelkøbstad, but x/z difference comes from OCR. Hard to ensure they're all 100% correct with scanned pages.
If one outputs crude alphabet in a set
based on starting letters of all entries, the alphabetical order will be incorrect.
'a',
'b',
'd',
'e',
'i',
'f',
'g',
'h',
's',
'j',
'k',
'l',
'm',
'n',
'o',
'y',
'p',
'r',
't',
'u',
'v',
'x',
'æ',
'ø',
See i which seems to come after e. That is a hint that some entry has incorrectcly OCR'd starting letter. Try to detect these via script, apply exceptions
Currently drops entries that are not valid after they're supposed to be combined. There's bound to be odd edge cases, do something smart about them.
2065-opgulpe.txt
has highly irregular meta line. For these cases, we could simply have custom map for exceptions.
Current approach of naively detecting "is this first word in line? does it end in comma?" results in some false positives. For example:
Page: 71-arbejdelse.txt
First entry: dég,
is considered an entry and even capitalized.
Should have more exact detection. Perhaps pass expected starting letter to entry parsing.
Currently we're missing an entity that would combine entries from individual pages. Almost every single page will have an entry whose definition will overflow to the next page.
We'll anyway have to loop through all the pages at some point, so might aswell add Dictionary (or similar) class that can juggle individual pages & handle bigger picture of entries.
"Røttenest" in page 2387-røtte (rotte).txt
is undetected, even though the OCR should be clear enough for regex match. See whats wrong.
Pages like 97-balstyrig.txt
are incorrectly read. The original scan is a bit skewed, which probably means the gif should be rotated before OCR.
Try to figure out how many of these are & how should they be handled.
For example, page 962-gørrel.txt
has two entries:
They look like headwords starting with letter H, which is completely valid for that page. Coincidentally they are also in alphabetical order, so all seems good. Both are still misread, as they should be part of the previous entry, "Had".
There may be no easy way to tell these apart automatically. Lets's add a mapping or something where we can list such exceptions so that they'll be parsed to previous entries.
Current letter handling based on meta line has gotten quite complex and soon has more edge cases than normal logic.
See if alternative way would be better:
Currently some empty or emptyish lines may be inserted to entry parsing. Should add a bit more sophisticated checks that'd check if line only contains spaces, linebreaks and other fluff.
Current implementation keeps all whitespaces and oddities.
See Axeltand,\n
. Should be trimmed away.
This applies at least to the very first page. Either add handling, expection or modification to the page itself.
There are some headwords that can not be reliably recognized from OCR output. If OCR can not be improved in this regard, think of another way.
For example: Afklappe - Afkynde page has:
Which are not recognized by current implementation. Lets gather problematic entries to this issue and see what can be done about it.
Some headwords might have a linebreak in them. However, it looks like these are not actual main headwords, but additional info like Afkontrajefe
, which is part of Afkontrafej
.
Current implementation counts it as headword. Add way to detect these & append back to previous entry.
Currently Abeganterino.narreverk
is read as an entry, as in line it is Abeganterino.narre-
. Should probably disallow periods in entry regex
The very first entry in the dictionary "Abbot" is OCR'd as "Åbbot". This could be fixed in the typolist, but then again such simple cases could be remedied with the "expected starting letter". If the page only has entries for letter A, it could simply force letter A for all headwords that do not already have it. As long as we take note of partial entries.
Current implementation trims whitespace. This makes it tricky to detect some headwords, as new headword may start with just indentation / whitespace at the start of the row.
Alter settings & output whitespaced version
It sems like headwords always have a comma after them. Current implementation allows other entries too. This could be used as a bit-less-naive check if we're dealing with an actual headword or not. If not, the whole thing should probably be appended to the previous entry.
Example: Bable
Currently part of Bables description is recognized as a headword, called Babel
. It is quite tricky false positive to recognize from the headword alone.
In this case, Bable has numbered definitions, like 1) and 2). The second one is in Babel. Should an entry have a definition that starts from something else than the first definition, just append it back to the previous entry.
In "V to Y" transition page, an entry — Ya(eyfærdig,
is undetected. As it looks like a clear match, there is probably an issue in regex. Check what is up
Relating to #37:
Definitions may end in a "dash" that implies a new entry starts. For example:
"no. svælg. Moth. Smlgn.1.surgel. —"
"go. skylle hals. en. Moth. —"
Lets trim those away from ends of definitions
On pages where one letter ends and another begins, an unorthodox column layout is used. Between letters there is a divide, so the first letter will always be completely on the top part of the page (in two columns) while the second letter will be on the bottom part of the page (in two columns).
This means that we cant just "break text by columns, glue back together" for those pages. For those pages we should detect that we're dealing with two letter page and do something else. May be tricky, as those "letter heading breaks" do not seem to appear in the OCR'd output.
If it cant be otherwise detected, we could keep hardcoded mapping of these "two letter" pages, as at worst there can be 20+ of them. Then we could split them differently based on whatever criteria.
Page 962-gørrel.txt
has combination of previous meta line issues. It won't drop singular "4", as there is expected amount of meta parts. This is due to two words appearing as combination word, as OCR read them together.
There is handling for separating those lines, but it is guarded by clause of having too few parts. Should be adjusted so it drops this extra "4"
See if 2387-røtte (rotte).txt
OCR can be improved with rotation, as the scan is skewed.
It takes a while to generate all text files from images. While not too bad when doing it once, the quality of data means it may need to be done multiple times till we get it all right.
Just add few workers & process them in larger batches.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.