Git Product home page Git Product logo

old-danish-dictionary-builder's People

Contributors

dependabot[bot] avatar stscoundrel avatar

Watchers

 avatar  avatar

old-danish-dictionary-builder's Issues

Empty headwords

The final output contains 8 empty headwords. They do have definitions though.

See whats going on & if they should just be part of the previous entry. Would be easiest fix that way, should it make sense.

Headwords: actual linebreak detection & fixing

Some headwords are split into two rows in the book. They should be recognized parsed back together.

For example, Axelvej ends up undetected, as while it does start with promising regex, it ends in dash instead of comma. We'll have to add the dashful variant to regexes & do additional parsing for these entries whose headword is partial.

Column detection: include odd spacing close to middle.

Sometimes OCR fails to detect the column breaker. Instead, it is generally something like four spaces along the middle of the line.

If a line is missing the divider, see if middle contains such arrangement of spaces & use it instead.

Scraper: downloads letters K and Æ twice.

Due to mistake in letterPages list, the downloader downloads letters K and Æ.

  • Drop duplicates
  • See if images should be redownloaded. The ordering would change & the parser contains references to the current number system. For repeatability it should always provide same results, but that would might mean running all images through OCR again, which could take hours. Therefore, see if numbering can be remedied in other way.

Headword linebreak parsing: false positive

Headword Ymte, causes false positive in headword parsing, which is meant to combine headwords that are broken into two lines. The headword ends up as Ymte,hafide.

Ensure only headwords that end in dash get this combining.

"Proofread" incorrect OCR headwords

Some headwords are incorrectly OCR'd. While some text/definition content probably is too, the issue is much more glaring in headwords. Examples:

  • Azelkøbstad -> should be Axelkøbstad
  • Azelvej -> should be Axelvej

While it is hard or impossible to detect all of these, we could add some "proofreader" mapping for headwords, which could be updated as we face these.

Should we want some automatic detection, we could check if alphabetic order for entries matches. For example, one page has:

        "Azelkøbstad",
        "Axelskav",
        "Axeltorg",
        "Axelseng",
        "Axeltand",
        "Azelvej",
        "Axel",
        "Axelmærke",
        "Axeniere",

Which does easily reveal which ones are incorrectly read. But then again, they also could as easily be misread in a way that is still alphabetically valid, especially if given letter only has few entires.

Add handling for irregular meta lines

Current PAGES_TO_IRREGULAR_META_LINE_INDEXES mapping contains notes about some pages whose meta lines seemed irregular / might break parsing. Add test cases for them. Some of the TODO's are duplicates, so one might want to first go through the listed files and see the variations.

Probably needs something like:

  • Cleaning up additional "parts" which are one only one letter.
  • Logic if there are still numbers in both ends.
  • Other cleanup and/or exceptions per page name.

Produce a list of entries whose definitions start with number higher than 1

Some entries have multiple definitins like 1) foo bar 2) bar baz etc.

Few of those are incorrectly read, should definition have something that looks like a headword. Lets run a script that outputs entries that look suspicious in that regard. If problematic ones are find (=should not be their own entries), add them to false positives map.

Letters based on meta line: errors in OCR

There are some OCR errors in meta lines, which means letters in page may be misread. Example:

71-arbejdelse.txt has only entries starting with "A". After OCR, it thinks it has "A" and "Å".

That could be remedied with sanity check, like "can Å be the next letter after A". But then again, "A" could also be misread.

How about:

  • Read preliminary letter from filename. It should be available, or at least easily injectable.
  • Make that first letter canon. Compare second to it & alter it if needed.

Headwords: combined two-line headwords may have incorrect casing

Example: AzelKøbstad should be Azelkøbstad. As it is in two lines, the OCR seemed to deem the letter k as capital.

It is unlikely that there are too many actual uppercase letters within headwords. We could just lowercase it, unless more edge cases appear.

Note: the actual word would be Axelkøbstad, but x/z difference comes from OCR. Hard to ensure they're all 100% correct with scanned pages.

Detect incorrect start letters

If one outputs crude alphabet in a set based on starting letters of all entries, the alphabetical order will be incorrect.

    'a',
    'b',
    'd',
    'e',
    'i',
    'f',
    'g',
    'h',
    's',
    'j',
    'k',
    'l',
    'm',
    'n',
    'o',
    'y',
    'p',
    'r',
    't',
    'u',
    'v',
    'x',
    'æ',
    'ø',

See i which seems to come after e. That is a hint that some entry has incorrectcly OCR'd starting letter. Try to detect these via script, apply exceptions

Improved entry detection: don't rely on first word & comma alone

Current approach of naively detecting "is this first word in line? does it end in comma?" results in some false positives. For example:

Page: 71-arbejdelse.txt
First entry: dég, is considered an entry and even capitalized.

Should have more exact detection. Perhaps pass expected starting letter to entry parsing.

Higher level Dictionary class (or similar) that oversees pages

Currently we're missing an entity that would combine entries from individual pages. Almost every single page will have an entry whose definition will overflow to the next page.

We'll anyway have to loop through all the pages at some point, so might aswell add Dictionary (or similar) class that can juggle individual pages & handle bigger picture of entries.

Entry "Røttenest" is undetected

"Røttenest" in page 2387-røtte (rotte).txt is undetected, even though the OCR should be clear enough for regex match. See whats wrong.

Some OCR pages are completely incorrect

Pages like 97-balstyrig.txt are incorrectly read. The original scan is a bit skewed, which probably means the gif should be rotated before OCR.

Try to figure out how many of these are & how should they be handled.

Detected headwords that are false positives

For example, page 962-gørrel.txt has two entries:

  • Hesiodis
  • Højsgaard

They look like headwords starting with letter H, which is completely valid for that page. Coincidentally they are also in alphabetical order, so all seems good. Both are still misread, as they should be part of the previous entry, "Had".

There may be no easy way to tell these apart automatically. Lets's add a mapping or something where we can list such exceptions so that they'll be parsed to previous entries.

Letter/meta handling by filename

Current letter handling based on meta line has gotten quite complex and soon has more edge cases than normal logic.

See if alternative way would be better:

  • Deduce letter by filename
  • Split pages are already mapped -> they should be the exception, that has two letters.
  • Drop most of the meta parsing: Use it as just meta information.

Better "empty" detection prior to entry parsing

Currently some empty or emptyish lines may be inserted to entry parsing. Should add a bit more sophisticated checks that'd check if line only contains spaces, linebreaks and other fluff.

Entry parsing: definitions formatting

Current implementation keeps all whitespaces and oddities.

  • Trim the content
  • Detect multiple definitions: should they be numbered, detect & split. Use a list of strings anyway.

Unrecognized headwords: extra handling

There are some headwords that can not be reliably recognized from OCR output. If OCR can not be improved in this regard, think of another way.

For example: Afklappe - Afkynde page has:

  • Afklappe
  • Afklare
  • Afkom

Which are not recognized by current implementation. Lets gather problematic entries to this issue and see what can be done about it.

Headwords with linebreak -> probably not headwords

Some headwords might have a linebreak in them. However, it looks like these are not actual main headwords, but additional info like Afkontrajefe, which is part of Afkontrafej.

Current implementation counts it as headword. Add way to detect these & append back to previous entry.

Headword regex: disallow periods

Currently Abeganterino.narreverk is read as an entry, as in line it is Abeganterino.narre-. Should probably disallow periods in entry regex

Headwords: OCR first letter issues

The very first entry in the dictionary "Abbot" is OCR'd as "Åbbot". This could be fixed in the typolist, but then again such simple cases could be remedied with the "expected starting letter". If the page only has entries for letter A, it could simply force letter A for all headwords that do not already have it. As long as we take note of partial entries.

Image to text: include whitespace

Current implementation trims whitespace. This makes it tricky to detect some headwords, as new headword may start with just indentation / whitespace at the start of the row.

Alter settings & output whitespaced version

Entry parsing: better headword formatting

It sems like headwords always have a comma after them. Current implementation allows other entries too. This could be used as a bit-less-naive check if we're dealing with an actual headword or not. If not, the whole thing should probably be appended to the previous entry.

If contents first entry is higher than 1, append to previous entry

Example: Bable

Currently part of Bables description is recognized as a headword, called Babel. It is quite tricky false positive to recognize from the headword alone.

In this case, Bable has numbered definitions, like 1) and 2). The second one is in Babel. Should an entry have a definition that starts from something else than the first definition, just append it back to the previous entry.

Multiple definitions

  • Change "definitions" to a list of str instead of str.
  • If entry contains multiple numbered definitions, break them into individual definitions

Second round of OCR and page number cleanup

Relating to #37:

  • Backup previously OCR'd images.
  • OCR newly ordered images again.
  • Check that OCR provided similar enough results again, ie. are the typos the same etc.
  • If all is good, adjust hardcoded references to page numbers in parser to match the new ones. Few hundred pages were removed in #37 as they were duplicates

Definitions that end in dashes

Definitions may end in a "dash" that implies a new entry starts. For example:

  • "no. svælg. Moth. Smlgn.1.surgel. —"
  • "go. skylle hals. en. Moth. —"

Lets trim those away from ends of definitions

Two letter pages: incorrect column parsing

On pages where one letter ends and another begins, an unorthodox column layout is used. Between letters there is a divide, so the first letter will always be completely on the top part of the page (in two columns) while the second letter will be on the bottom part of the page (in two columns).

This means that we cant just "break text by columns, glue back together" for those pages. For those pages we should detect that we're dealing with two letter page and do something else. May be tricky, as those "letter heading breaks" do not seem to appear in the OCR'd output.

If it cant be otherwise detected, we could keep hardcoded mapping of these "two letter" pages, as at worst there can be 20+ of them. Then we could split them differently based on whatever criteria.

Meta line issue in 962-gørrel.txt

Page 962-gørrel.txt has combination of previous meta line issues. It won't drop singular "4", as there is expected amount of meta parts. This is due to two words appearing as combination word, as OCR read them together.

There is handling for separating those lines, but it is guarded by clause of having too few parts. Should be adjusted so it drops this extra "4"

Skewed pages

See if 2387-røtte (rotte).txt OCR can be improved with rotation, as the scan is skewed.

  • Create test case to see what entries are currently read
  • See how rotation affects them.

OCR: use multiple workers

It takes a while to generate all text files from images. While not too bad when doing it once, the quality of data means it may need to be done multiple times till we get it all right.

Just add few workers & process them in larger batches.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.