Git Product home page Git Product logo

langdata's People

Contributors

alonehoney avatar aslamy avatar atuyosi avatar indiclinguist avatar jimregan avatar mymonoo avatar nickjwhite avatar pa-hobe avatar ryanfb avatar shreeshrii avatar stweil avatar theraysmith avatar vigneshv59 avatar wincentbalin avatar zdenop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

langdata's Issues

Add U+02BC to Devanagari.unicharset

Some languages of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or as a length mark in their texts written in Devanagari script.

eg. ख’ल्ल
ित’लकना
दख’ना
खर’
कत’ पड़ा’ गेल’?

kan.unicharambigs seems to be copy of eng.unicharambigs

Reviving an old issue - see RaghavBhardwaj/tesseract-ocr#801

I am shocked to notice that kan.unicharambigs contains only English script
instead of kannada script . I had expected kan.unicharambigs will contain
kannada script only.

Suggest building of https://github.com/tesseract-ocr/langdata/blob/master/kan/kan.unicharambigs based on https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-801/comment-6/kan.DangAmbigs.txt and review by someone who knows kannada.

German issues

Neither deu/desired_characters nor deu/deu.training_text or deu/deu.wordlist include the paragraph character (§). That character is used very often especially in German legal texts, also in the form §§ (plural form, meaning paragraphs). Instead of the paragraph character, Tesseract typically detects a dollar sign (but also other confusions). The paragraph character is missing in other languages, too.

Also missing is the long S character (ſ) which was used in texts from the 18th century, not only in German but also in Latin, French and Spanish.

For Tesseract 4, Fraktur typefaces are currently unsupported. They were used often until 1945 (rarely used after that).

Langdata for 3.02

I am looking for langdata (Tel, Kan, Guj). How can I get the langdata for 3.02, currently repository is saying langdata is for 3.04

CC: @jimregan

khmer - not working with --oem 1

Ref: tesseract-ocr/tesseract#654 (comment)

Attached box/tiff pairs created using text2image / tesstrain.sh rendering proces.

Fonts used (installed in Windows)

KHMER_FONTS=( \
    "Leelawadee UI Bold" \
    "Leelawadee UI" \
    "Noto Sans Khmer" \
    "Noto Serif Khmer" \
    "Noto Sans Khmer Bold" \
    "Noto Serif Khmer Bold" \
    "Noto Sans Khmer UI Bold" \
    "Noto Sans Khmer UI" \
    )

command used:

training/tesstrain.sh --fonts_dir /mnt/c/Windows/Fonts --lang khm  \
  --linedata_only --noextract_font_properties \
   --langdata_dir ../langdata --tessdata_dir ./tessdata \
  --output_dir ~/tesstutorial/khm 

European 18th century texts

Many European texts from the 18th century use modern types with some special properties. OCR for such texts is currently only partially supported by Tesseract, notably by enm, frm, ita_old and spa_old (see wiki) which are the only models including the long s.

Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.

Bigrams file not in sync with training text

@theraysmith

Ray, I am using a modified version of Sanskrit training text with Vedic accents, rupee sign etc.

The font properties generation in the processing by tesstrain.sh script uses the ngrams file which seems to be based on the bigrams file

Is the bigrams file generated based on the training text? If so, is there a utility to create bigrams from training text.

Will use of a new training text which is not in sync with bigrams file cause errors during training?

training small fonts- Chinese

Hi, where can I go for information on how to train small Chinese fonts. Do I just make my PDF have much smaller fonts? My target is the 100% Chinese font in Chrome as seen in Google News.

Tibetan Unicharset

Tibetan Unicharset is in a different format than the unicharsets for other script sets. It was uploaded as part of 3.04 langdata and seems to be be based on syllables rather than characters (similar to unicharsets that are generated during training are based on training_text).

http://unicode.org/charts/PDF/U0F00.pdf shows all the Tibetan characters in unicode but the Tibetan unicharset https://github.com/tesseract-ocr/langdata/blob/master/Tibetan.unicharset is based on combination of characters.

For other scripts eg. Devanagari, the unicharset and unicode chart follow similar format. This data was uploaded during the initial creation of repo and reflects 3.02.
http://unicode.org/charts/PDF/U0900.pdf
https://github.com/tesseract-ocr/langdata/blob/master/Devanagari.unicharset

@theraysmith Which is the correct format of unicharsets going forward? for 4.0.0-alpha?

If it is being limited to the character combinations as in Tibetan for 3.04, then it could cause a problem if a new training text is used which has some new syllable combinations.

About Uyghur(Uighur) langdata

Hi,
I am native Uyghur.
I found some error characters(not Uyghur characters in "uig.training_text"). I fixed the errors. please update.

I make uig.frequent_words_list file. Words sorted by frequency.
the "all_character_forms.txt" contains all uyghur chracters and their all forms. It can be added to uig.training_text file.

uig.zip
all_character_forms.txt

LSTM: character set vs Script unicharset vs Training text unicharset

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 refers to

The lstmtraining program is a multi-purpose tool for training neural networks. The following table describes its command-line options:

Flag	Type	Default	Explanation
U	string	none	Path to the unicharset for the character set.

and also

Fine tuning is the process of training an existing model on new data without changing anything else like the character set or any part of the network. Doesn't need a unicharset, script_dir, or net_spec, as they all come from the existing model.

and

Fine tuning is OK if you don't want to change the character set, but what if you want to train for Klingon? You are unlikely to have much training data and it is unlike anything else, so what do you do? You can try removing some of the top layers of an existing network model, replace some of them with new randomized layers, and train with your data. The command-line is mostly the same as Training from scratch, as you have to supply a unicharset and net_spec, and you also have to provide a model to --continue_from and --append_index.

@theraysmith Ray, please clarify what is the character set referred to here? Thanks!

langdata/pol/pol.wordlist duplicated entries

Hello!

In https://raw.githubusercontent.com/tesseract-ocr/langdata/master/pol/pol.wordlist (05ec588 on 25 Jun 2015) is a great list of Polish words.

Somehow, though, even as I am Polish, using Polish keyboard and Polish Windows 8.1 with Polish fonts, I see "st" (or "?" in notepad2) in many lines - this is not a character you encounter at all in Polish language.

After a few seconds, I realised that every line with the mysterious st sign follows a line with "st" bigram - for which the st substitutes.

There is no "st" digraph (there are no strict bigraphs in Polish at all - they are always made with two separate letters, "rz" is just "r" and "z", "ch" is just "c" and "h", the same goes for "sz", "cz", "dz", "dż" and "dź") in Polish.

So, basically, the list has a duplicated entry for every word with "st" bigraph.
There are currently 658822 full lines + 1 newline in the raw file; after I made a quick regexp in notepad2 to remove the duplicates, I ended up with 608933 full lines + 1 newline - an 8% reduction in line count.

Now, if there is a legitimate reason for the duplicates with a non-existent characters (maybe it's easier to OCR with such redundancy? I don't think I know the topic well enough to even guess) then great, this issue is moot and invalid. But if there is no such reason, then the Polish wordlist can be automatically pruned.

Use a different set of fonts for Persian

(moved from tesseract-ocr/tesseract#294)

First of all, thanks for adding support to tesseract finally. From quickly inspecting Persian related codes on tesseract I reached to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L520 which I can say speculatively is not a good set of fonts for training Persian printed text and can result in poor performance of OCR quality as most Persian fonts don't have the style these fonts have. On "Font recognition using Variogram fractal dimension", a good set of Persian fonts is introduced (second page, at the bottom) which as you can see there also, it is different from favorites Arabic language fonts (even the fact both are using Arabic script). So for training Persian OCR for tesseract I suggest adding or replacing current fonts with these free fonts, Nazli (i.e. Nazanin as indicated on that article) and Titr from Debian fonts-farsiweb package and also XB Zar and XB Yaghut from OFL licensed xfonts. Thank you.

Add Devanagari-extended and Vedic extensions to Devanagari.unicharset

See tesseract-ocr/tesseract#545 for details

at a minimum add support for U+0951, U+0952, U+A8F3, U+1CDA in Devanagari unicharset otherwise LSTM training/evaluation fails with messages such as

Can't encode transcription: शान्तिः॒ शान्तिः॑ ॥ ॐ पूर्ण॒मदः॒ पूर्ण॒मिदं॒ पूर्णा॒त्पूर्ण॒मुद॒च्यते । पूर्ण॒स्य
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffff82 ffffffe0 fffff
fa5 ffffff91 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffff
ffa4 ffffffaf ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffff
a4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffad ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffff82 ffffffe0 ffffffa5
 ffffff91 20 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff92 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8c ffffffe0 ffffffa4 ffffffad ffffffe0 ffffffa5 ffffff91 ffffffe0 ffff
ffa4 ffffff97 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffff
a4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 20 ffffffe0 ffffffa5 ffffffa5 20 ffffffe0 ffffffa5 ffffffac ffffffe0 ffffffa5 ffffffa5
Can't encode transcription: त॒नुवं॑ पि॒प्रय॑स्वा॒स्मभ्यं॑ च॒ सौभ॑ग॒माय॑जस्व ॥ ६॥
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffaf 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ff
ffffa5 ffffff83 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb7 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa0 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff92 ffffffe0 ffff
ffa4 ffffffad ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff82 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffb8 ffffffe0 fff
fffa4 ffffffbe ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa5 ffffff92 20 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff88 ffffffe0 ff
ffffa4 ffffffb7 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa3 ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffff82 20 ffffffe0 f
fffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffff95 20 ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa5 ffffff92 ffffffe0 ffffffa4 ffffffb9 20 ffffff
e0 ffffffa4 ffffffae ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa5 ffffff91 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff8d ffffffe0
 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8d 20 ffffffe0 ffffffa5 ffffffa5 20 ffffffe0 ffffffa5 ffffffad ffffffe0 ffffffa5 ffffffa5
Can't encode transcription: नाक॑स्य पृ॒ष्ठम॒भि सं॒वसा॑नो॒ वैष्ण॑वीं लो॒क इ॒ह मा॑दयन्ताम् ॥ ७॥
At iteration 0, stage 0, Eval Char error rate=3.5082962, Word error rate=13.780345

Superscripts & subscripts

Copied from 59:


@Shreeshrii commented

Just checking whether this new training will also address:

  1. Correct handling of superscripts

@theraysmith commented

  1. Correct handling of superscripts

Beyond the scope of this change.
Sub/superscript are much harder to deal with, as they have to be trained,
and that means incorporating them correctly into the training path, and how
to pass the information back out of the line recognizer to the output.
At the moment it seems the iterator supports discovery of sub/super, but
there is no output renderer that handles it. (Not even hocr?)

Question:
For which languages/scripts is is desirable to support sub/super?


Shreeshrii commented

Regarding superscripts/subscripts etc, I can point out three cases based on the languages I know.

a. English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

b. Tamil - Sanskrit texts transliterated in Tamil scripts use superscripts/subscripts 2,3,4 (sometimes 1 also) to distinguish between different sounds (to support sanskrit alphabet which does not have direct mapping in Tamil script). These can actually be in middle of Tamil words.

c. Hindi, Sanskrit and other Indian languages - Hindi books, thesis etc use superscripts for referring to footnotes (similar to English above). The difference is that in some cases these will be using the Latin alphabet 0-9 and in some cases using Devanagari digits (in case of Hindi, Sanskrit etc). Unicode has superscripts 0-9 for Latin script but not for Devanagari script. I would suggest support for the Latin script superscript numbers.

Scanned pages with devanagari superscripts should also be mapped to the Latin script superscript numbers. Similarly for other Indian languages.


@stweil commented

English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

At least it applies to German. There are also superscripts after punctuation characters at the end of sentences.

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.


Shreeshrii commented

See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals.

The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.


Shreeshrii commented

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

All superscripts have a special UTF-8 code, though in different ranges. Not all fonts have support for all superscripts and subscripts.

Arabic and Hebrew unicharsets seems to be corrupted

Arabic unicharset has character descriptions in two formats - seems to be corrupted. See the following for example:

؀ 0 92,92,148,148,152,152,5,5,161,161 Arabic 1 5 1 ؀	# ؀ [600 ]
؁ 0 67,67,135,135,306,306,1,1,312,312 Arabic 2 5 2 ؁	# ؁ [601 ]
؂ 0 93,93,141,141,152,152,5,5,163,163 Arabic 3 5 3 ؂	# ؂ [602 ]
؃ 0 93,93,155,155,171,171,5,5,182,182 Arabic 4 5 4 ؃	# ؃ [603 ]
؋ 0 36,36,195,195,54,54,5,5,64,64 Arabic 5 13 5 ؋	# ؋ [60b ]
؍ 10 106,106,148,148,47,47,7,7,61,61 Arabic 6 13 6 ؍	# ؍ [60d ]p
؎ 0 111,111,156,156,138,138,5,5,150,150 Arabic 7 10 7 ؎	# ؎ [60e ]
؏ 0 10,10,174,174,99,99,1,1,113,113 Arabic 8 10 8 ؏	# ؏ [60f ]
ؐ 0 213,213,251,251,47,47,35,35,0,0 Arabic 9 17 9 ؐ	# ؐ [610 ]
ؑ 0 209,209,244,244,47,47,31,31,0,0 Arabic 10 17 10 ؑ	# ؑ [611 ]
ؒ 0 213,213,255,255,68,68,26,26,0,0 Arabic 11 17 11 ؒ	# ؒ [612 ]
ؓ 0 213,213,255,255,66,66,26,26,0,0 Arabic 12 17 12 ؓ	# ؓ [613 ]
ؔ 0 227,240,255,255,87,98,0,1,0,0 Arabic 13 17 13 ؔ	# ؔ [614 ]
ؕ 0 228,248,255,255,50,78,8,49,0,179 Arabic 14 17 14 ؕ	# ؕ [615 ]
؞ 10 96,99,146,148,49,51,5,6,59,64 Arabic 15 13 15 ؞	# ؞ [61e ]p
ء 1 32,112,129,220,43,163,3,67,62,199 Arabic 16 13 16 ء	# ء [621 ]x
آ 1 26,117,230,255,36,161,0,58,33,198 Arabic 17 13 17 آ	# آ [622 ]x
أ 1 26,117,248,255,29,148,0,67,33,193 Arabic 18 13 18 أ	# أ [623 ]x
ؤ 1 0,68,190,255,70,290,0,27,62,266 Arabic 19 13 19 ؤ	# ؤ [624 ]x
إ 1 0,59,200,255,29,181,0,67,33,222 Arabic 20 13 20 إ	# إ [625 ]x
ئ 1 0,100,185,255,95,431,0,45,103,467 Arabic 21 13 21 ئ	# ئ [626 ]x
ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 22 13 22 ا	# ا [627 ]x
ب 1 0,71,140,224,113,339,0,50,123,378 Arabic 23 13 23 ب	# ب [628 ]x
ة 1 55,123,190,255,40,181,0,60,48,222 Arabic 24 13 24 ة	# ة [629 ]x
ت 1 58,123,170,255,113,339,2,50,123,378 Arabic 25 13 25 ت	# ت [62a ]x
ث 1 58,121,192,255,113,339,2,50,123,378 Arabic 26 13 26 ث	# ث [62b ]x

Suggest 'deva' for Devanagari

With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.

The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.

So, in addition to the multiple traineddata for various languages written in Devaन

cherokee resources

In response to tesseract-ocr/tesseract#654 (comment),

@theraysmith

An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .

The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.

chr - Cherokee - http://crubadan.org/languages/chr

Cherokee Unicode Fonts

http://www.cherokee.org/AboutTheNation/Language/CherokeeFont.aspx
http://www.languagegeek.com/font/fontdownload.html

German Fraktur

From tesseract-ocr/tesseract#40

@stweil commented

Are there also new data files planned for old German (deu_frak)? I was
surprised that the default English model with LSTM could recognize some
words.

@theraysmith commented

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur
until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English

stweil commented

Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur.

There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.

@jbaiter commented

I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.

theraysmith commented

The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect!

300k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first.
For now it might be best to hang on for the instructions.

jbaiter commented

The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.

Related:
tesseract-ocr/tessdata#49

Vietnamese

Forwarding below some feedback re Vietnamese traineddata for 4.00.00

Vietnamese lang data for tess 4.00 seems to have better accuracy, but still sometimes mixes up between acute and hook above marks when they appear on top of circumflex mark (stack diacritics).

Æ missing from the Norwegian language training data

The Norwegian character Æ is missing from the language files for Norwegian. Performing OCR where the Æ is present results in either /E or AE.

Some samples:
Ærfuglveien 44 er adressen jeg bor på.
Min adresse er Ærfuglgaten 73.
Ærlighet varer lengst.
Ærfuglen er den største andearten i vårt land.
Ærekrenkelse er en handling som består i å krenke en annens æresfølelse, eller opptre på en måte som er egnet til å skade en annens gode navn og rykte eller til å utsette ham for hat, ringeakt eller tap av den for hans stilling eller næring fornødne tillit.
Æsene lå i kamp med en annen gudeslekt, vanene.
Ærgjerrighet har vært viktig for mange av oss og da vi var småjenter, skjønte vi at det er viktig å arbeide hardt og bli til noe.
Det var Æsene som var snille.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.