Hi, I want to develop an OCR for Balinese (<a href="https://en.wikipedia.or

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Balinese Script OCR,about tesseract-ocr/langdata

Comments (26)

stweil commented on May 23, 2024

Hi @gindrawan, with jTessBoxEditor you will get a recognition model which uses the old legacy recognizer, but not the LSTM one.

For training LSTM, you need a large number of ground truth data, that means pairs of line images and text files with the corresponding text. You can use generated images by rendering the text with a Balinese font, and you can also use scans from Balinese publications (books, newspapers, ...) where you have to extract the lines and transcribe the text. Ideally both kinds of images are available.

from langdata.

Shreeshrii commented on May 23, 2024

Are there any converters from Bali Simbar Dwijendra to Unicode?

from langdata.

gindrawan commented on May 23, 2024

Are there any converters from Bali Simbar Dwijendra to Unicode?

As far as I know, there is no such converter. I found Vimala font with glyph shape quite close to Bali Simbar Dwijendra font, as I mentioned at #126.

from langdata.

gindrawan commented on May 23, 2024

Hi @Shreeshrii ,

Based on your tesseract code base changing in

tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d

if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed)

tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d
tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d

I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks...

from langdata.

Shreeshrii commented on May 23, 2024

jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images.

…

On Tue, Mar 24, 2020, 09:58 gindrawan ***@***.***> wrote: Hi @Shreeshrii <https://github.com/Shreeshrii> , Based on your tesseract code base changing in ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d> ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ> .

from langdata.

gindrawan commented on May 23, 2024

jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images.
…
On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d> @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ .

Thank you @Shreeshrii
Here they are scanned page images from book (quick search from the Internet) with various image type and size.
I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow.

Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android (https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0

balinese-script-images-v1.zip

from langdata.

Shreeshrii commented on May 23, 2024

Just images are not enough. What is needed is the correct (ground truth) text in unicode format for each of those images. So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt for the unicode text for each. For a work in progress, see https://github.com/Shreeshrii/tesstrain-bali/tree/master/test I need the correct text for the images so that it can be compared with the OCRed text to verify accuracy on actual images.

…

On Tue, Mar 24, 2020 at 1:56 PM gindrawan ***@***.***> wrote: jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … <#m_-1197623344891217353_> On Tue, Mar 24, 2020, 09:58 gindrawan *@*.*> wrote: Hi @Shreeshrii <https://github.com/Shreeshrii> https://github.com/Shreeshrii <https://github.com/Shreeshrii> , Based on your tesseract code base changing in @.*#diff-eaafd22a79065f5b8d28318d482e650d < ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d>> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) *@*.*#diff-eaafd22a79065f5b8d28318d482e650d ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d>> @.*#diff-eaafd22a79065f5b8d28318d482e650d < ***@***.***#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d>> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment) <#152 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ . Thank you @Shreeshrii <https://github.com/Shreeshrii> Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow. Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android ( https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0 balinese-script-images-v1.zip <https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from langdata.

gindrawan commented on May 23, 2024

Just images are not enough. What is needed is the correct (ground truth) text in unicode format for each of those images. So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt for the unicode text for each. For a work in progress, see https://github.com/Shreeshrii/tesstrain-bali/tree/master/test I need the correct text for the images so that it can be compared with the OCRed text to verify accuracy on actual images.
…
On Tue, Mar 24, 2020 at 1:56 PM gindrawan @.> wrote: jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … <#m_-1197623344891217353_> On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d < @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d>> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d>> @.#diff-eaafd22a79065f5b8d28318d482e650d < @.**#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d>> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment) <#152 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ . Thank you @Shreeshrii https://github.com/Shreeshrii Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow. Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android ( https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0 balinese-script-images-v1.zip https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Sorry, I forgot about the txt. May be need longer time for that.
Ok then, I still prepare for the synthetic images, I think faster to make it ready.
One question, how many words needed per line ?

from langdata.

gindrawan commented on May 23, 2024

This is small pair image and text file using Noto Serif Balinese, I took them from https://en.wikipedia.org/wiki/Balinese_script. Hope can be used for now..
small-pair-image-text.zip

from langdata.

gindrawan commented on May 23, 2024

Oh, I forgot. Do the image need its box file or only the unicode text?

from langdata.

Shreeshrii commented on May 23, 2024

Just the unicode text.

…

On Tue, Mar 24, 2020 at 6:02 PM gindrawan ***@***.***> wrote: Oh, I forgot. Do the image need its box file or only the unicode text? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I3BFPEY6ARA35XUQ2TRJCR6NANCNFSM4LM4TXMQ> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from langdata.

gindrawan commented on May 23, 2024

Hi @Shreeshrii,

It seems more time I need to prepare the training data (1-2 more days).

Meanwhile, I just realize that there are kind of training data in page images (https://github.com/topherseance/javanese-aksara-training-text) and line images.

Based on your previous answer, it seems you prefer line images? What happened with page images?

On preparing line images in my case, it seems more effort because a page image need to be converted to several line images. But if training result will better enough, it's Ok then.

At the attachment I have sample of my page image with its ground truth text. Is that Ok before I proceed further to line images?

ban.notoserifbalinese.gt_001.zip

from langdata.

Shreeshrii commented on May 23, 2024

Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?

from langdata.

Shreeshrii commented on May 23, 2024

https://github.com/Shreeshrii/tesstrain-bali/tree/master/langdata

I had done a training run with 4-5 fonts.

from langdata.

gindrawan commented on May 23, 2024

Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?

I am preparing about 5 thousands word (the remaining about 29 thousands word still on verification on the unicode) for synthetic data using Noto Serif Balinese, just download the latest font, updated 3 days ago (https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/ttf/NotoSerifBalinese). Somehow more updated than Noto Sans Balinese.

Those 5 thousands words has already transformed into 101 page images, each contains 12 line training texts, each line about 5-10 words. Need a little more time to finalized it. If go into line images, well.. need more extra time.

After that I am going to Vimala with the same unicode with Noto Serif Balinese. Vimala more likely needed for actual images recognition.

The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode, so more time and effort to prepare the training data. Actually, if involved BSD, the balinese script recognition app would has 2 option for post processing: unicode and non-unicode (I imagine some switch radio button to select before recognition).

from langdata.

Shreeshrii commented on May 23, 2024

Generation of synthetic data is not an issue. It is actually quite easy to generate page images or line images given a training text and set of fonts.

See https://github.com/Shreeshrii/tesstrain-bali/tree/master/gt/bali-Vimala
which has line images and their groundtruth generated from random sanskrit text (https://github.com/Shreeshrii/tesstrain-bali/blob/master/langdata/bali.training_text) converted to Balinese script. This is not showing up correctly in my web brower, but it is ok when I apply the Vimala font in notepad++.

LSTM training works on line images, so it is better to do line images. But this can be done easily by a computer.

It seems to me that you are just taking a word list and generating text lines and images from that. Instead you should actually be using sentences and paragraphs and phrases along with punctuation similar to the pages that need to be recognized.

The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode,

If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.

When I asked for page images for testing, I meant some sample actual images (in BSD) .

I am generating images in five fonts:
Kadiri
Noto Sans Balinese
Noto Serif Balinese
Pustaka Bali
Vimala

However, if only Vimala is required, it will probably be faster to get convergence.

from langdata.

gindrawan commented on May 23, 2024

It's ok I think you put all of those fonts. Kadiri, Pustaka, and Vimala seem try to mimic certain different styles of ancient glyph. Moreover Vimala was also developed with BSD style reference. Noto Sans Balinese and Noto Serif Balinese seem not so many difference each other. I don't know what the consideration Google release both of them.

from langdata.

gindrawan commented on May 23, 2024

If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.

@Shreeshrii , I just make any map from BSD to Balinese Unicode, perhaps it useful.
bsdcode.2.balineseunicode.txt

from langdata.

Shreeshrii commented on May 23, 2024

Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?

I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify.

bsd2unicode.sed.txt

udhr.unicode.txt
udhr.latn.txt

from langdata.

gindrawan commented on May 23, 2024

Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?

It is in Balinese Latin (like Javanese Latin using convention name "java"; and its Javanese Script using "java-jav") . From there we can convert it to many Balinese Script (BSD, Vimala, Noto Serif Balinese, etc) but need some rule-based text preprocessing first.
For an example:
First word "Sami" at the second line must be convert for

BSD (non-unicode) to "smi" (see illustration file ODT LibreOffice at the attachment)
Other Balinese font (unicode) to "ᬲᬫᬶ"

At the reverse process (Balinese script to Balinese Latin), actually I don't know, how to make this work in Tesseract, as I illustrated it at the attachment.

sami.zip

from langdata.

gindrawan commented on May 23, 2024

Oh, for Balinese Script to Balinese Latin at the illustration file
"the input" means "the image input"

from langdata.

gindrawan commented on May 23, 2024

This is my libre office screenshot. You must install bali simbar dwijendra font at your linux OS.

from langdata.

Shreeshrii commented on May 23, 2024

The way tesseract (lstm version) works, the image will be recognised as Unicode text which will render correctly with Unicode Balinese fonts. So, both Vimala and Noto fonts should be able to render the same output.

from langdata.

gindrawan commented on May 23, 2024

I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify.

Hi @Shreeshrii , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link).

I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD.

Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en

bsd2unicode.sed.txt
bakta.zip

from langdata.

Shreeshrii commented on May 23, 2024

You have given link to apps that convert from Latn to BSD as well as Noto (Unicode) for Balinese. What will be helpful, if you want to train for BSD, is you can send me two text files, one in BSD and one in Noto, for the same Balinese text. Similar to file you sent earlier, but that was just one word.

…

On Mon, Mar 30, 2020 at 10:54 AM gindrawan ***@***.***> wrote: I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify. Hi @Shreeshrii <https://github.com/Shreeshrii> , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link). I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD. Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en bsd2unicode.sed.txt <https://github.com/tesseract-ocr/langdata/files/4400924/bsd2unicode.sed.txt> bakta.zip <https://github.com/tesseract-ocr/langdata/files/4400927/bakta.zip> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37IZS3T3OLJUIPQWFBV3RKAUKRANCNFSM4LM4TXMQ> .

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

from langdata.

gindrawan commented on May 23, 2024

I just make them but still in small size since quite manual to generate them.
https://github.com/gindrawan/balinese-script-training
I am thinking how to speed it up...

How will you train tesseract wilth such data?
I guess you will feed it up with generated image (from related BSD gt text file) and mapping it to NSB gt text file.

from langdata.

Balinese Script OCR about langdata HOT 26 OPEN

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent