tesseract-ocr / tesstrain Goto Github PK

View Code? Open in Web Editor NEW

569.0 569.0 172.0 13.52 MB

Train Tesseract LSTM with make

License: Apache License 2.0

Makefile 16.37% Python 83.63%

ocr tesseract training

tesstrain's People

Contributors

Stargazers

Watchers

Forkers

kamilc kba rabi3elbeji dangvinh1406 lorenzob discharge97 atharmarajah kvkevin thongvm 0xsalim oyvindlgjesdal riyanadsilva mb1024code deimsdeutsch huangshaoze studuino braimourad arlaf idolized22 acedive s0bar nickmhnk robertsonwang bkuasney spayefin stweil bolyachevets wrznr ireoo rishi-arch valrcs tykniess cvresearch-fun mikylucky joyce-cl spc16670 jertlok jkamlah supunspwithana tdf1995 craig-matadeen xhuvom sanjanasanikommu younesh11 shalevy1 lannt-xyz kurhula ankan-das ulb-sachsen-anhalt ted430 ucodai abhishekthanki arnabdas william-chin feitianno3 chladams beczzzhao aoso3 iamprashant gerhobbelt shunshuiyuanxin celestialized adrianacamilleri ortega-dan venkatapathy onmo-games supriyoa kforti dakotagporter vishaldhimanai global-localhost global19 global19-atlassian-net eyagarci nagadomi isabella232 bertsky wario2k tfukumori akash-akya zaheer-at-techlo9 sgly2002 copninixh chegejames arifemre wenxiang-li sc-sc epu-diepcd techthumb1 piaoxiaobo acgnhiki yahiachames z-aliakseyeu jingjie181 lantip jithinanchanattu buliasz toumingl maxpark kolod

tesstrain's Issues

Rename GT directory

data/train is misleading since it contains both training and evaluation data.

newly trained tesseract model not working

We have trained tesseract with custom data having 2000 images for 10k iteration.The size of the trained file (digitsmodel.traineddata) is very less (5.1 KB).
When we are testing the newly trained model,we are getting the following error

raise TesseractError(status_code, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Error: LSTM requested, but not present!! Loading tesseract. Failed loading language 'digitsmodel' Tesseract couldn't load any languages! Could not initialize tesseract.")

License

Hi !

Please add license info.

That's the first thing I search in any project that looks interesting.

Update README.md

paths
variable names
...

Make Training Errors with Sample Data

OS:
Ubuntu 18.04

What I typed in Terminal:
make training

What I received:
python generate_line_box.py -i "data/ground-truth/andreas_fenitschka_1898_0085_025.tif" -t "data/ground-truth/andreas_fenitschka_1898_0085_025.gt.txt" > "data/ground-truth/andreas_fenitschka_1898_0085_025.box"
Traceback (most recent call last):
File "generate_line_box.py", line 41, in
print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128)
Makefile:111: recipe for target 'data/ground-truth/andreas_fenitschka_1898_0085_025.box' failed
make: *** [data/ground-truth/andreas_fenitschka_1898_0085_025.box] Error 1

Add support for finetuning

I use a modified version of makefile for finetuning. Please add the relevant functionality from file below to the current makefile or add new versions for finetuning and replace a layer training.

export

SHELL := /bin/bash
LOCAL := $(PWD)/usr
PATH := $(LOCAL)/bin:$(PATH)
HOME := /home/ubuntu
TESSDATA =  $(HOME)/tessdata_best
LANGDATA = $(HOME)/langdata

# Name of the model to be built
MODEL_NAME = frk

# Name of the model to continue from
CONTINUE_FROM = frk

# Normalization Mode - see src/training/language_specific.sh for details 
NORM_MODE = 2

# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best

# Train directory
TRAIN := data/train

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
	@echo ""
	@echo "  Targets"
	@echo ""
	@echo "    unicharset       Create unicharset"
	@echo "    lists            Create lists of lstmf filenames for training and eval"
	@echo "    training         Start training"
	@echo "    proto-model      Build the proto model"
	@echo "    leptonica        Build leptonica"
	@echo "    tesseract        Build tesseract"
	@echo "    tesseract-langs  Download tesseract-langs"
	@echo "    langdata         Download langdata"
	@echo "    clean            Clean all generated files"
	@echo ""
	@echo "  Variables"
	@echo ""
	@echo "    MODEL_NAME         Name of the model to be built"
	@echo "    CORES              No of cores to use for compiling leptonica/tesseract"
	@echo "    LEPTONICA_VERSION  Leptonica version. Default: $(LEPTONICA_VERSION)"
	@echo "    TESSERACT_VERSION  Tesseract commit. Default: $(TESSERACT_VERSION)"
	@echo "    LANGDATA_VERSION   Tesseract langdata version. Default: $(LANGDATA_VERSION)"
	@echo "    TESSDATA_REPO      Tesseract model repo to use. Default: $(TESSDATA_REPO)"
	@echo "    TRAIN              Train directory"
	@echo "    RATIO_TRAIN        Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data
RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf

# Create unicharset
unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval
lists: $(ALL_LSTMF) data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
	   head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
	   tail -n "+$$no" $(ALL_LSTMF) > "$@"

# Start training
training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)
	combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  $(TESSDATA)/$(CONTINUE_FROM).
	unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
	merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"
	
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
	
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
	python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box
	tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train
	

# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
	combine_lang_model \
	  --input_unicharset data/unicharset \
	  --script_dir $(LANGDATA) \
	  --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
	  --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
	  --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
	  --output_dir data/ \
	  --lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
	mkdir -p data/checkpoints
	lstmtraining \
	  --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
	  --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	  --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	  --model_output data/checkpoints/$(MODEL_NAME) \
	  --debug_interval -1 \
	  --train_listfile data/list.train \
	  --eval_listfile data/list.eval \
	  --sequential_training \
	  --max_iterations 3000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
	lstmtraining \
	--stop_training \
	--continue_from $^ \
	--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	--model_output $@

# Clean all generated files
clean:
	find data/train -name '*.box' -delete
	find data/train -name '*.lstmf' -delete
	rm -rf data/all-*
	rm -rf data/list.*
	rm -rf data/$(MODEL_NAME)
	rm -rf data/unicharset
	rm -rf data/checkpoints

Training data from scratch gives error rate alomst 100% and very less ocr accuracy

we are training data from scratch using ocrd- train for devnagari script. We train from following samples images which uses 10 line samples to trained data.

training log after make training
python generate_line_box.py -i "data/ground-truth/marathi1-001.exp0.tif" -t "data/ground-truth/marathi1-001.exp0.gt.txt" > "data/ground-truth/marathi1-001.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-002.exp0.tif" -t "data/ground-truth/marathi1-002.exp0.gt.txt" > "data/ground-truth/marathi1-002.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-003.exp0.tif" -t "data/ground-truth/marathi1-003.exp0.gt.txt" > "data/ground-truth/marathi1-003.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-004.exp0.tif" -t "data/ground-truth/marathi1-004.exp0.gt.txt" > "data/ground-truth/marathi1-004.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-005.exp0.tif" -t "data/ground-truth/marathi1-005.exp0.gt.txt" > "data/ground-truth/marathi1-005.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-006.exp0.tif" -t "data/ground-truth/marathi1-006.exp0.gt.txt" > "data/ground-truth/marathi1-006.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-007.exp0.tif" -t "data/ground-truth/marathi1-007.exp0.gt.txt" > "data/ground-truth/marathi1-007.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-008.exp0.tif" -t "data/ground-truth/marathi1-008.exp0.gt.txt" > "data/ground-truth/marathi1-008.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-009.exp0.tif" -t "data/ground-truth/marathi1-009.exp0.gt.txt" > "data/ground-truth/marathi1-009.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-010.exp0.tif" -t "data/ground-truth/marathi1-010.exp0.gt.txt" > "data/ground-truth/marathi1-010.exp0.box" find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes" unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes" Extracting unicharset from box file data/all-boxes Wrote unicharset file data/unicharset tesseract data/ground-truth/marathi1-001.exp0.tif data/ground-truth/marathi1-001.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-002.exp0.tif data/ground-truth/marathi1-002.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-003.exp0.tif data/ground-truth/marathi1-003.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-004.exp0.tif data/ground-truth/marathi1-004.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-005.exp0.tif data/ground-truth/marathi1-005.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-006.exp0.tif data/ground-truth/marathi1-006.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-007.exp0.tif data/ground-truth/marathi1-007.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-008.exp0.tif data/ground-truth/marathi1-008.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-009.exp0.tif data/ground-truth/marathi1-009.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-010.exp0.tif data/ground-truth/marathi1-010.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf" total=cat data/all-lstmf | wc -l \ no=echo "$total * 0.90 / 1" | bc; \ head -n "$no" data/all-lstmf > "data/list.train" total=cat data/all-lstmf | wc -l \ no=echo "($total - $total * 0.90) / 1" | bc; \ tail -n "$no" data/all-lstmf > "data/list.eval" combine_lang_model \ --input_unicharset data/unicharset \ --script_dir data/ \ --output_dir data/ \ --lang foo Loaded unicharset of size 29 from file data/unicharset Setting unichar properties Setting script properties Failed to load script unicharset from:data//Devanagari.unicharset Warning: properties incomplete for index 3 = ग Warning: properties incomplete for index 4 = ण Warning: properties incomplete for index 5 = ज Warning: properties incomplete for index 6 = र Warning: properties incomplete for index 7 = म Warning: properties incomplete for index 8 = न Warning: properties incomplete for index 9 = क Warning: properties incomplete for index 11 = व Warning: properties incomplete for index 12 = उ Warning: properties incomplete for index 13 = ळ Warning: properties incomplete for index 14 = घ Warning: properties incomplete for index 15 = ड Warning: properties incomplete for index 16 = ए Warning: properties incomplete for index 17 = अ Warning: properties incomplete for index 18 = ह Warning: properties incomplete for index 19 = द Warning: properties incomplete for index 20 = ब Warning: properties incomplete for index 21 = ल Warning: properties incomplete for index 23 = प Warning: properties incomplete for index 24 = ट Warning: properties incomplete for index 25 = च Warning: properties incomplete for index 26 = त Warning: properties incomplete for index 27 = ध Warning: properties incomplete for index 28 = फ Config file is optional, continuing... Failed to read data from: data//foo/foo.config Null char=2 mkdir -p data/checkpoints lstmtraining \ --traineddata data/foo/foo.traineddata \ --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset`]"
--model_output data/checkpoints/foo
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000
Warning: given outputs 29 not equal to unicharset of 28.
Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc28:28, 7196
Total weights = 511100
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc28] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c29]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=27
Loaded 16/16 pages (1-16) of document data/ground-truth/marathi1-001.exp0.lstmf
Loaded 14/14 pages (1-14) of document data/ground-truth/marathi1-008.exp0.lstmf
Loaded 14/14 pages (1-14) of document data/ground-truth/marathi1-005.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-007.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-010.exp0.lstmf
Loaded 25/25 pages (1-25) of document data/ground-truth/marathi1-009.exp0.lstmf
Loaded 24/24 pages (1-24) of document data/ground-truth/marathi1-003.exp0.lstmf
Loaded 26/26 pages (1-26) of document data/ground-truth/marathi1-002.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-004.exp0.lstmf
Loaded 25/25 pages (1-25) of document data/ground-truth/marathi1-006.exp0.lstmf
At iteration 99/100/100, Mean rms=7.226%, delta=3.012%, char train=267%, word train=95%, skip ratio=0%, New worst char error = 267 wrote checkpoint.

At iteration 199/200/200, Mean rms=7.06%, delta=2.755%, char train=189.5%, word train=95.5%, skip ratio=0%, New worst char error = 189.5 wrote checkpoint.

At iteration 290/300/300, Mean rms=7.036%, delta=2.443%, char train=164.667%, word train=96.667%, skip ratio=0%, New worst char error = 164.667 wrote checkpoint.

At iteration 387/400/400, Mean rms=7.074%, delta=2.502%, char train=160.5%, word train=96.75%, skip ratio=0%, New worst char error = 160.5 wrote checkpoint.

At iteration 480/500/500, Mean rms=7.07%, delta=2.291%, char train=148.6%, word train=97.4%, skip ratio=0%, New worst char error = 148.6 wrote checkpoint.

At iteration 566/600/600, Mean rms=7.111%, delta=2.129%, char train=140.5%, word train=97.833%, skip ratio=0%, New worst char error = 140.5 wrote checkpoint.

At iteration 644/700/700, Mean rms=7.153%, delta=2.117%, char train=135.786%, word train=98.143%, skip ratio=0%, New worst char error = 135.786 wrote checkpoint.

At iteration 707/800/800, Mean rms=7.186%, delta=2.221%, char train=135.375%, word train=98.375%, skip ratio=0%, New worst char error = 135.375 wrote checkpoint.

At iteration 779/900/900, Mean rms=7.204%, delta=2.397%, char train=140.944%, word train=98.556%, skip ratio=0%, New worst char error = 140.944 wrote checkpoint.

At iteration 863/1000/1000, Mean rms=7.187%, delta=2.838%, char train=154.15%, word train=98.6%, skip ratio=0%, New worst char error = 154.15 wrote checkpoint.

At iteration 956/1100/1100, Mean rms=7.088%, delta=3.46%, char train=147.1%, word train=98.4%, skip ratio=0%, New worst char error = 147.1 wrote checkpoint.

At iteration 1023/1200/1200, Mean rms=7.12%, delta=3.307%, char train=150.85%, word train=98.8%, skip ratio=0%, New worst char error = 150.85 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1110/1300/1300, Mean rms=7.109%, delta=3.743%, char train=161.15%, word train=98.6%, skip ratio=0%, New worst char error = 161.15 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1198/1400/1400, Mean rms=7.1%, delta=4.097%, char train=180.9%, word train=98.9%, skip ratio=0%, New worst char error = 180.9At iteration 1023, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1297/1500/1500, Mean rms=7.025%, delta=4.76%, char train=194.95%, word train=98.8%, skip ratio=0%, New worst char error = 194.95At iteration 1110, stage 0, Eval Char error rate=498.07692, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1394/1600/1600, Mean rms=6.898%, delta=5.583%, char train=216.4%, word train=98.8%, skip ratio=0%, New worst char error = 216.4At iteration 1198, stage 0, Eval Char error rate=401.92308, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1481/1700/1700, Mean rms=6.778%, delta=6.239%, char train=242.95%, word train=98.8%, skip ratio=0%, New worst char error = 242.95At iteration 1297, stage 0, Eval Char error rate=169.23077, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1553/1800/1800, Mean rms=6.624%, delta=6.812%, char train=251.35%, word train=98.1%, skip ratio=0%, New worst char error = 251.35At iteration 1394, stage 0, Eval Char error rate=426.92308, Word error rate=100 wrote checkpoint.

At iteration 1628/1900/1900, Mean rms=6.422%, delta=7.195%, char train=243.8%, word train=96.9%, skip ratio=0%, wrote checkpoint.

At iteration 1715/2000/2000, Mean rms=6.287%, delta=7.481%, char train=231.55%, word train=95.7%, skip ratio=0%, wrote checkpoint.

At iteration 1812/2100/2100, Mean rms=6.187%, delta=7.493%, char train=222.05%, word train=95.3%, skip ratio=0%, wrote checkpoint.

At iteration 1910/2200/2200, Mean rms=5.954%, delta=8.482%, char train=215.4%, word train=93.4%, skip ratio=0%, wrote checkpoint.

At iteration 2010/2300/2300, Mean rms=5.696%, delta=8.712%, char train=203.35%, word train=92.2%, skip ratio=0%, wrote checkpoint.

At iteration 2109/2400/2400, Mean rms=5.341%, delta=8.584%, char train=180.55%, word train=90.7%, skip ratio=0%, wrote checkpoint.

At iteration 2209/2500/2500, Mean rms=4.999%, delta=8.082%, char train=166.15%, word train=89.5%, skip ratio=0%, wrote checkpoint.

At iteration 2309/2600/2600, Mean rms=4.668%, delta=7.392%, char train=144.55%, word train=87.9%, skip ratio=0%, wrote checkpoint.

At iteration 2409/2700/2700, Mean rms=4.319%, delta=6.791%, char train=116.85%, word train=86%, skip ratio=0%, wrote checkpoint.

At iteration 2509/2800/2800, Mean rms=3.999%, delta=6.177%, char train=104.55%, word train=84.6%, skip ratio=0%, wrote checkpoint.

At iteration 2607/2900/2900, Mean rms=3.959%, delta=6.183%, char train=116.2%, word train=85.2%, skip ratio=0%, wrote checkpoint.

At iteration 2706/3000/3000, Mean rms=3.995%, delta=5.89%, char train=121.1%, word train=86.1%, skip ratio=0%, wrote checkpoint.

At iteration 2794/3100/3100, Mean rms=4.082%, delta=5.492%, char train=135%, word train=86.5%, skip ratio=0%, wrote checkpoint.

At iteration 2886/3200/3200, Mean rms=4.173%, delta=5.168%, char train=145.8%, word train=87.8%, skip ratio=0%, wrote checkpoint.

At iteration 2966/3300/3300, Mean rms=4.274%, delta=4.768%, char train=147.05%, word train=87.7%, skip ratio=0%, wrote checkpoint.

At iteration 3013/3400/3400, Mean rms=4.391%, delta=4.429%, char train=142.05%, word train=87.6%, skip ratio=0%, wrote checkpoint.

At iteration 3084/3500/3500, Mean rms=4.628%, delta=4.723%, char train=140.35%, word train=87%, skip ratio=0%, wrote checkpoint.

At iteration 3183/3600/3600, Mean rms=4.839%, delta=5.391%, char train=137.25%, word train=87.1%, skip ratio=0%, wrote checkpoint.

At iteration 3278/3700/3700, Mean rms=4.998%, delta=5.822%, char train=138.65%, word train=87.1%, skip ratio=0%, wrote checkpoint.

At iteration 3376/3800/3800, Mean rms=5.113%, delta=6.118%, char train=138.6%, word train=87.5%, skip ratio=0%, wrote checkpoint.

At iteration 3473/3900/3900, Mean rms=4.971%, delta=5.798%, char train=126.35%, word train=86.3%, skip ratio=0%, wrote checkpoint.

At iteration 3571/4000/4000, Mean rms=4.692%, delta=5.524%, char train=116.6%, word train=85.3%, skip ratio=0%, wrote checkpoint.

At iteration 3659/4100/4100, Mean rms=4.75%, delta=6.248%, char train=119.7%, word train=85.5%, skip ratio=0%, wrote checkpoint.

At iteration 3752/4200/4200, Mean rms=4.531%, delta=5.832%, char train=112.7%, word train=84.5%, skip ratio=0%, wrote checkpoint.

At iteration 3845/4300/4300, Mean rms=4.32%, delta=5.683%, char train=111%, word train=84.5%, skip ratio=0%, wrote checkpoint.

At iteration 3942/4400/4400, Mean rms=4.164%, delta=5.861%, char train=114.7%, word train=84.7%, skip ratio=0%, wrote checkpoint.

At iteration 4039/4500/4500, Mean rms=3.936%, delta=5.578%, char train=115.05%, word train=84.6%, skip ratio=0%, wrote checkpoint.

At iteration 4135/4600/4600, Mean rms=3.752%, delta=4.956%, char train=118.75%, word train=84.9%, skip ratio=0%, wrote checkpoint.

At iteration 4234/4700/4700, Mean rms=3.61%, delta=4.569%, char train=116.45%, word train=84.9%, skip ratio=0%, wrote checkpoint.

At iteration 4334/4800/4800, Mean rms=3.518%, delta=4.338%, char train=116.95%, word train=84.8%, skip ratio=0%, wrote checkpoint.

At iteration 4434/4900/4900, Mean rms=3.452%, delta=4.207%, char train=116.65%, word train=84.7%, skip ratio=0%, wrote checkpoint.

At iteration 4534/5000/5000, Mean rms=3.411%, delta=4.13%, char train=115.6%, word train=84.3%, skip ratio=0%, wrote checkpoint.

2 Percent improvement time=4634, best error was 100 @ 0
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 4634/5100/5100, Mean rms=3.028%, delta=3.178%, char train=99.1%, word train=83.4%, skip ratio=0%, New best char error = 99.1At iteration 1481, stage 0, Eval Char error rate=130.76923, Word error rate=100 wrote checkpoint.

2 Percent improvement time=4734, best error was 100 @ 0
At iteration 4734/5200/5200, Mean rms=2.955%, delta=3.145%, char train=97.15%, word train=83.4%, skip ratio=0%, New best char error = 97.15 wrote checkpoint.

At iteration 4834/5300/5300, Mean rms=2.915%, delta=3.138%, char train=98.65%, word train=83.8%, skip ratio=0%, New worst char error = 98.65 wrote checkpoint.

At iteration 4934/5400/5400, Mean rms=2.877%, delta=3.082%, char train=97.45%, word train=83.1%, skip ratio=0%, New worst char error = 97.45 wrote checkpoint.

At iteration 5034/5500/5500, Mean rms=2.852%, delta=3.031%, char train=99.25%, word train=83.5%, skip ratio=0%, New worst char error = 99.25 wrote checkpoint.

At iteration 5134/5600/5600, Mean rms=2.825%, delta=2.988%, char train=97.9%, word train=83%, skip ratio=0%, New worst char error = 97.9 wrote checkpoint.

At iteration 5234/5700/5700, Mean rms=2.807%, delta=2.946%, char train=100%, word train=83.1%, skip ratio=0%, New worst char error = 100 wrote checkpoint.

At iteration 5334/5800/5800, Mean rms=2.788%, delta=2.886%, char train=99.7%, word train=83.4%, skip ratio=0%, New worst char error = 99.7 wrote checkpoint.

At iteration 5434/5900/5900, Mean rms=2.771%, delta=2.819%, char train=99.85%, word train=83.8%, skip ratio=0%, New worst char error = 99.85 wrote checkpoint.

At iteration 5534/6000/6000, Mean rms=2.76%, delta=2.757%, char train=101.25%, word train=84.2%, skip ratio=0%, New worst char error = 101.25 wrote checkpoint.

At iteration 5634/6100/6100, Mean rms=2.747%, delta=2.709%, char train=101.25%, word train=84.5%, skip ratio=0%, New worst char error = 101.25 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 5734/6200/6200, Mean rms=2.735%, delta=2.655%, char train=101.05%, word train=84.4%, skip ratio=0%, New worst char error = 101.05At iteration 1553, stage 0, Eval Char error rate=267.30769, Word error rate=100 wrote checkpoint.

At iteration 5834/6300/6300, Mean rms=2.726%, delta=2.623%, char train=100.6%, word train=84.3%, skip ratio=0%, wrote checkpoint.

At iteration 5934/6400/6400, Mean rms=2.716%, delta=2.594%, char train=100.95%, word train=84.7%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6034/6500/6500, Mean rms=2.721%, delta=2.591%, char train=101.1%, word train=84.8%, skip ratio=0%, New worst char error = 101.1At iteration 4734, stage 0, Eval Char error rate=151.92308, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6134/6600/6600, Mean rms=2.718%, delta=2.578%, char train=102.95%, word train=84.9%, skip ratio=0%, New worst char error = 102.95At iteration 5734, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.

At iteration 6234/6700/6700, Mean rms=2.729%, delta=2.585%, char train=102.05%, word train=85%, skip ratio=0%, wrote checkpoint.

At iteration 6334/6800/6800, Mean rms=2.73%, delta=2.581%, char train=102.9%, word train=85.4%, skip ratio=0%, wrote checkpoint.

At iteration 6434/6900/6900, Mean rms=2.743%, delta=2.604%, char train=102.75%, word train=85.2%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6534/7000/7000, Mean rms=2.742%, delta=2.593%, char train=103.15%, word train=85.6%, skip ratio=0%, New worst char error = 103.15At iteration 6034, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.

At iteration 6634/7100/7100, Mean rms=2.746%, delta=2.596%, char train=102.05%, word train=84.7%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6734/7200/7200, Mean rms=2.747%, delta=2.59%, char train=103.4%, word train=85.3%, skip ratio=0%, New worst char error = 103.4At iteration 6134, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.

At iteration 6834/7300/7300, Mean rms=2.753%, delta=2.6%, char train=102.8%, word train=84.7%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6934/7400/7400, Mean rms=2.76%, delta=2.614%, char train=103.9%, word train=85.1%, skip ratio=0%, New worst char error = 103.9At iteration 6534, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.

At iteration 7034/7500/7500, Mean rms=2.76%, delta=2.613%, char train=103.85%, word train=85%, skip ratio=0%, wrote checkpoint.

At iteration 7134/7600/7600, Mean rms=2.771%, delta=2.633%, char train=103.45%, word train=85%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7234/7700/7700, Mean rms=2.77%, delta=2.628%, char train=104.55%, word train=85.5%, skip ratio=0%, New worst char error = 104.55At iteration 6734, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7334/7800/7800, Mean rms=2.779%, delta=2.647%, char train=104.85%, word train=85.2%, skip ratio=0%, New worst char error = 104.85At iteration 6934, stage 0, Eval Char error rate=165.38462, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7434/7900/7900, Mean rms=2.775%, delta=2.633%, char train=105.7%, word train=85.8%, skip ratio=0%, New worst char error = 105.7At iteration 7234, stage 0, Eval Char error rate=190.38462, Word error rate=100 wrote checkpoint.

At iteration 7534/8000/8000, Mean rms=2.782%, delta=2.656%, char train=105.2%, word train=84.4%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7634/8100/8100, Mean rms=2.781%, delta=2.649%, char train=106.8%, word train=85.1%, skip ratio=0%, New worst char error = 106.8At iteration 7334, stage 0, Eval Char error rate=213.46154, Word error rate=100 wrote checkpoint.

At iteration 7734/8200/8200, Mean rms=2.785%, delta=2.665%, char train=106.35%, word train=84.3%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7834/8300/8300, Mean rms=2.788%, delta=2.664%, char train=108.05%, word train=84.8%, skip ratio=0%, New worst char error = 108.05At iteration 7434, stage 0, Eval Char error rate=144.23077, Word error rate=100 wrote checkpoint.

At iteration 7934/8400/8400, Mean rms=2.794%, delta=2.677%, char train=107.4%, word train=84.1%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8034/8500/8500, Mean rms=2.799%, delta=2.682%, char train=109.45%, word train=84.9%, skip ratio=0%, New worst char error = 109.45At iteration 7634, stage 0, Eval Char error rate=76.923077, Word error rate=76.923077 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8134/8600/8600, Mean rms=2.796%, delta=2.669%, char train=110.05%, word train=85%, skip ratio=0%, New worst char error = 110.05At iteration 7834, stage 0, Eval Char error rate=128.84615, Word error rate=96.153846 wrote checkpoint.

At iteration 8234/8700/8700, Mean rms=2.792%, delta=2.663%, char train=109.35%, word train=84.5%, skip ratio=0%, wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8334/8800/8800, Mean rms=2.898%, delta=3.043%, char train=111.75%, word train=84.7%, skip ratio=0%, New worst char error = 111.75At iteration 8034, stage 0, Eval Char error rate=76.923077, Word error rate=76.923077 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8434/8900/8900, Mean rms=3.051%, delta=3.584%, char train=119.75%, word train=85.3%, skip ratio=0%, New worst char error = 119.75At iteration 8134, stage 0, Eval Char error rate=194.23077, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8533/9000/9000, Mean rms=3.331%, delta=4.42%, char train=127.6%, word train=86.9%, skip ratio=0%, New worst char error = 127.6At iteration 8334, stage 0, Eval Char error rate=150, Word error rate=100 wrote checkpoint.

At iteration 8632/9100/9100, Mean rms=3.371%, delta=4.523%, char train=126.95%, word train=87.5%, skip ratio=0%, wrote checkpoint.

At iteration 8732/9200/9200, Mean rms=3.364%, delta=4.511%, char train=126.25%, word train=87.9%, skip ratio=0%, wrote checkpoint.

At iteration 8832/9300/9300, Mean rms=3.35%, delta=4.496%, char train=124.9%, word train=87.4%, skip ratio=0%, wrote checkpoint.

At iteration 8932/9400/9400, Mean rms=3.35%, delta=4.518%, char train=125.5%, word train=87.9%, skip ratio=0%, wrote checkpoint.

At iteration 9031/9500/9500, Mean rms=3.383%, delta=4.664%, char train=124.6%, word train=87.3%, skip ratio=0%, wrote checkpoint.

At iteration 9130/9600/9600, Mean rms=3.393%, delta=4.727%, char train=124.5%, word train=87.2%, skip ratio=0%, wrote checkpoint.

At iteration 9228/9700/9700, Mean rms=3.401%, delta=4.76%, char train=124.95%, word train=87.4%, skip ratio=0%, wrote checkpoint.

At iteration 9328/9800/9800, Mean rms=3.293%, delta=4.384%, char train=122.7%, word train=86.8%, skip ratio=0%, wrote checkpoint.

At iteration 9428/9900/9900, Mean rms=3.14%, delta=3.847%, char train=113.95%, word train=85.6%, skip ratio=0%, wrote checkpoint.

At iteration 9528/10000/10000, Mean rms=2.847%, delta=2.985%, char train=106.2%, word train=85.3%, skip ratio=0%, wrote checkpoint.

Finished! Error rate = 97.15
`

after performing ocr using newly trained data it gives following output text for sample image

नर

टन

नरण

टन

न र

नट
`

please suggest what is going wrong as to improve ocr accuracy.

The dependency from `proto-model` on `data/unicharset` does not work

data/unicharset is rebuilt everytime I run make proto-model (leading to errors since the underlying all-boxes file has been already deleted)

finetuning

Hi,

What is the status of finetuning? Could you provide an example in README?

I've downloaded eng.traineddata from https://github.com/tesseract-ocr/tessdata_best, put it into data/eng/eng.traineddata directory and set START_MODEL=eng in Makefile.

But this gives me the following error in make training

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Makefile:131: recipe for target 'data/checkpoints/foo_checkpoint' failed

make langdata

make: *** No rule to make target 'langdata'. Stop.

Ubuntu 18.04

Adding box file for whole page results a very high Error rate

Hello .

I am facing an issue related to the Error rate I have added box file contains many lines break them by \t but the Error rate is very high and the lstm model is not converging.

Here is a page sample example.

train_sample.zip

Create and integrate a fancy logo tesstrain

E.g. https://www.flickr.com/photos/sdasmarchives/27078655499

Devanagari script box files not being generated

~/ocrd-train$ make unicharset
python generate_line_box.py -i "data/train/devatest-0001-010001.tif" -t "data/train/devatest-0001-010001-gt.txt" > "data/train/devatest-0001-010001.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 39, in <module>
    if not unicodedata.combining(line[-1]):
IndexError: string index out of range
Makefile:92: recipe for target 'data/train/devatest-0001-010001.box' failed
make: *** [data/train/devatest-0001-010001.box] Error 1

Rename

... to something with tesseract

make training errors with sample ground truth

I have extracted ocrd-testset.zip to ./data/ground-truth.

I type the commands:
root@CUDA1:/home/ocrd-train# export PYTHONIOENCODING=utf8
root@CUDA1:/home/ocrd-train# make training

The list.train file is empty. Does anyone know how to fix this?

Box files contain the same box for all characters

Hello guys

thanks for providing this tool. It eases one's life when using images instead of generating synthetic data for specific fonts.
I have been using it and thought that I have got it right! but when I checked the boxes files manually I found out that the box is the same for every character in the line segment and the final line segment is only 2*2 in width and height which is not the case!

Can you please guide me what I have done wrong!

I have only cloned the repo and i had tesseract and leptonica already built and installed and used the make training MODEL_NAME =mine

is this the correct box format to expect ? or am i missing something here?

P.S The Model is training fine and I am getting a final .traineddata and everything looked fine until I have checked this one because I was not very satisfied with the results.
Regards

Normalization Mode

This is not really an issue with ocrd-train and is regarding the following comment in makefile.

Normalization Mode - see src/training/language_specific.sh for details. Default: $(NORM_MODE)

Norm_modes are also specified in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

There is a conflict between the suggested norm_modes in language_specific.sh and unicharset_extractor.cpp.

I think that unicharset_extractor.cpp is correct and will create a PR in tesseract repo for changing language_specific.sh . See this comment for proposed change.

Meanwhile if you want you can change the comment in makefile.

Failed to load any lstm-specific dictionaries for lang xxx

I met this problem after training a model with OCRD,
in the terminal I input:
tesseract 5.2.tif output --psm 7 -l xxx
and I get this message:

Failed to load any lstm-specific dictionaries for lang tes!!
Tesseract Open Source OCR Engine v4.0.0-beta.4-138-g2093 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Anyone can help?

Issue in training with custom image dataset

I have the dataset (images in tif format and transcription in .gt.txt format) and moved to /data/train folder.
Running the training command :

 make training MODEL_NAME=name-of-the-resulting-model

gives me the following error:

make: *** No rule to make target 'JOB#4686', needed by 'data/all-boxes'.  Stop.

When am trying with your sample dataset ocrd-testset training runs without any error.

Running generate_line_box.py with my dataset yields the box values as expected.

Please suggest me what can be done or if its the issue with my dataset?

Attaching my sample dataset.(since tif not supported in github have attached jpeg)

2out_awb8_0_2408_2.gt.txt

Can't encode transcription: '| ঢাকা মেটো-গ |' in language ''

I am trying to train a new language with ben.traineddata . While providing sample training data with
lprBD-7.gt.txt and .tif image, I am getting the error

Can't encode transcription: '| ঢাকা মেটো-গ |' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff8d ffffffe0 ffffffa6 ffffffb0 ffffffe0 ffffffa7 ffffff8b 2d 20 ffffffe0 ffffffa6 ffffff97 20 7c 20 ffffffe0 ffffffa5 ffffffa4 Can't encode transcription: '\ ঢাকা মেট্রো- গ | ।' in language '' 2 Percent improvement time=0, best error was 7.2 @ 271 At iteration 271/2700/153861, Mean rms=0.211%, delta=0%, char train=0%, word train=0%, skip ratio=5600%, New best char error = 0 wrote best model:data/checkpoints/BigBenww0_271.checkpoint wrote checkpoint.

What changes should be made in order to train a new language?

more make training errors with sample ground truth

I am having an issue similar to https://github.com/OCR-D/ocrd-train/issues/47 except that I have already extracted ocrd-testset.zip to ./data/ground-truth.

I typed the commands:
root@CUDA1:/home/ocrd-train# export PYTHONIOENCODING=utf8
root@CUDA1:/home/ocrd-train# make training

Output:
tesseract data/ground-truth/alexis_ruhe01_1852_0018_022.tif data/ground-truth/alexis_ruhe01_1852_0018_022 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/alexis_ruhe01_1852_0018_022.tif
.
.
.
tesseract data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif data/ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif
find data/ground-truth -name '*.lstmf' -exec echo {} ; | sort -R -o "data/all-lstmf"
total=cat data/all-lstmf | wc -l
no=echo "$total * 0.90 / 1" | bc;
head -n "$no" data/all-lstmf > "data/list.train"
total=cat data/all-lstmf | wc -l
no=echo "($total - $total * 0.90) / 1" | bc;
tail -n "$no" data/all-lstmf > "data/list.eval"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2019-02-27 23:40:21-- https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.255.112, 192.30.255.113
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2019-02-27 23:40:21-- https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: 'data/radical-stroke.txt'

data/radical-stroke.txt 100%[======================================================================>] 323.12K --.-KB/s in 0.03s

2019-02-27 23:40:22 (11.5 MB/s) - 'data/radical-stroke.txt' saved [330874/330874]

combine_lang_model
--input_unicharset data/unicharset
--script_dir data/
--output_dir data/
--lang foo
Loaded unicharset of size 15 from file data/unicharset
Setting unichar properties
Other case A of a is not in unicharset
Other case N of n is not in unicharset
Other case D of d is not in unicharset
Other case E of e is not in unicharset
Other case R of r is not in unicharset
Other case h of H is not in unicharset
Other case F of f is not in unicharset
Other case c of C is not in unicharset
Other case V of v is not in unicharset
Other case L of l is not in unicharset
Other case I of i is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = a
Warning: properties incomplete for index 4 = n
Warning: properties incomplete for index 5 = d
Warning: properties incomplete for index 6 = e
Warning: properties incomplete for index 7 = r
Warning: properties incomplete for index 8 = H
Warning: properties incomplete for index 9 = f
Warning: properties incomplete for index 10 = C
Warning: properties incomplete for index 11 = v
Warning: properties incomplete for index 12 = l
Warning: properties incomplete for index 13 = i
Warning: properties incomplete for index 14 = .
Config file is optional, continuing...
Failed to read data from: data//foo/foo.config
Null char=2
mkdir -p data/checkpoints
lstmtraining
--traineddata data/foo/foo.traineddata
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset]"
--model_output data/checkpoints/foo
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000
Warning: given outputs 15 not equal to unicharset of 14.
Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc14:14, 3598
Total weights = 507502
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc14] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c15]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=13
Loaded 1/1 pages (1-1) of document data/ground-truth/alexis_ruhe01_1852_0332_007.lstmf
Failed to load list of eval filenames from data/list.eval
Failed to load eval data from: data/list.eval
Makefile:144: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** [data/checkpoints/foo_checkpoint] Error 1

cannot run make leptonica tesseract langdata

Hello dear OCR-D, thank you for making life easier with this repository.

I'm trying to run the make file but I keep getting this error.

_Run_ning aclocal
Running /usr/bin/libtoolize
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
Running autoheader
Running automake --add-missing --copy
configure.ac:314: installing 'config/compile'
configure.ac:23: installing 'config/missing'
src/api/Makefile.am: installing 'config/depcomp'
Running autoconf
Missing autoconf-archive. Check the build requirements.

Something went wrong, bailing out!

Makefile:162: recipe for target 'tesseract.built' failed
make: *** [tesseract.built] Error 1__

Any ideas?

thank you!

make training problem

Hello!
I'm trying to run the makefile with the test set provided in the directory without success. I created the .box filesm but the script end with this error when i run Make training:

combine_tessdata -u /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.traineddata /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.
Failed to read /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata
Makefile:97: recipe for target 'data/unicharset' failed
make: *** [data/unicharset] Error 1

I know it's a problem with my PATH but I don't really understand it. In that folder there are the .traineddata files.

In the OCR group i got this reply:

_You have some problems with your path configuration, check the error message:

Failed to read /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata

the path does not make sense. And also the command line:

combine_tessdata -u /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.traineddata  /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.

you probably also have a "blank" after "/usr/share/tessdata".


Bye

Lorenzo_

But I still don't understand why this happens, what do I have to modify in the Makefile to make it work?

Thank you!

`make training` gave me this error

tesseract data/ground-truth/alexis_ruhe01_1852_0087_027.tif data/ground-truth/alexis_ruhe01_1852_0087_027 --psm 6 lstm.train
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadFromMultipageTiff: function not present

make training problem with example dataset

Hi,

I got a problem when I try to run the make training example shown in the README. I have installed tesseract and leptonica, but when execute the training command, I got the following message:

find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Failed to read data from: data/all-boxes
Wrote unicharset file data/unicharset
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-26 17:13:03--  https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-26 17:13:03--  https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’

data/radical-stroke.txt             100%[=================================================================>] 323,12K   835KB/s    in 0,4s    

2018-12-26 17:13:05 (835 KB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]

combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir data/ \
  --output_dir data/ \
  --lang test_model
Loaded unicharset of size 3 from file data/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Config file is optional, continuing...
Failed to read data from: data//test_model/test_model.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/test_model/test_model.traineddata \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/test_model \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000
Failed to load list of training filenames from data/list.train
Makefile:144: recipe for target 'data/checkpoints/test_model_checkpoint' failed
make: *** [data/checkpoints/test_model_checkpoint] Error 1

Thanks.

Make training model failed with illegal line count -- 0

Hi!

I am trying to create a model with the data set given in the repo. I tried below command.

make training MODEL_NAME=name-of-the-resulting-mode
Dependencies are already installed in my mac with help of homebrew, I am getting the following errors.

Unfortunately, when I try this command I get the following errors:

find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Failed to read data from: data/all-boxes
Wrote unicharset file data/unicharset
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
	   no=`echo "$total * 0.90 / 1" | bc`; \
	   head -n "$no" data/all-lstmf > "data/list.train"
head: illegal line count -- 0
make: *** [data/list.train] Error 1

Thanks

Can't encode transcription: 'क्षक्षाक्षिक्षी क्षु क्षू्क्षे क्षैक्षो क्षौ क्षं क्षः' in language ''

while we train for sequence of characters in devanagari script like क्ष (क ् ष) it give us error.
How to trained such joint word/ letters having more than one characters
error.txt

Make training model

Hi!
I am trying to train a model with the test dataset given in the repo.

I am trying the following command after the installation of the dependencies:

 make training MODEL_NAME=name-of-the-resulting-model

Unfortunately when I try this command I get the following errors:

tesseract data/train/image.tif data/train/image --psm 6 lstm.train
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadFromMultipageTiff: function not present
...
...
Failed to load list of training filenames from data/list.train
Makefile:129 : recipe for target « data/checkpoints/name-of-the-resulting-model_checkpoint » failed
make: *** [data/checkpoints/name-of-the-resulting-model_checkpoint] Error 1

Thanks!

Add exception handling for empty GT files

Right now:

Traceback (most recent call last):
  File "generate_line_box.py", line 41, in <module>
    if not unicodedata.combining(line[-1]):
IndexError: string index out of range
Makefile:111: recipe for target 'data/ground-truth/example.box' failed

Finish implementation of finetuning (i.e. CONTINUE_FROM)

#15 implements only a part of the necessary changes for using CONTINUE_FROM, namely the unicharset merging. Setting the parameters for training as proposed by @Shreeshrii is still missing!

TESSDATA_REPO needs to be tessdata_best

Tesseract model repo to use. Default: $(TESSDATA_REPO)

TESSDATA_REPO = _fast

_fast models are integer models and can NOT be used as base for finetuning. Only models from tessdata_best can be used for this.

Add ocropy/kraken support

Can it be used on Windows?

I have try the command line on Windows， it does not work.

D:\MyGithub\ocrd-train-master>  make leptonica tesseract langdata
wget 'http://www.leptonica.org/source/leptonica-1.76.0.tar.gz'
process_begin: CreateProcess(NULL, wget http://www.leptonica.org/source/leptonica-1.76.0.tar.gz, ...) failed.
make (e=2):
Makefile:141: recipe for target 'leptonica-1.76.0.tar.gz' failed
make: *** [leptonica-1.76.0.tar.gz] Error 2

python generate_line_box.py -i "data/train/alexis_ruhe01_1852_0035_019.tif" -t "data/train/alexis_ruhe01_1852_0035_019.gt.txt" > "data/train/alexis_ruhe01_1852_0035_019.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 26, in <module>
    im = Image.open(file(args.image, "r"))
NameError: name 'file' is not defined
Makefile:91: recipe for target 'data/train/alexis_ruhe01_1852_0035_019.box' failed
make: *** [data/train/alexis_ruhe01_1852_0035_019.box] Error 1

make training fails on load script for unicharset

I get an error when running make training. It seems to insert an extra "/" in the reference to the data directory in one of the steps. Running on Ubuntu 18.04 in a Docker container.

Loaded unicharset of size 79 from file data/unicharset
Setting unichar properties
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Uͤ of uͤ is not in unicharset
Other case Aͤ of aͤ is not in unicharset
Other case Oͤ of oͤ is not in unicharset
Other case Y of y is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset

Add exception handling for GT sets which consist of too few files

Cf. #40

Having fewer than 10 lines results in an empty eval list.

ground-truth for RTL (Urdu)

Hi
it has been clarified else where that box files for RTL languages should be generated like LTR languages. The input data format for OCR-D is line images with corresponding text strings. The example data provided in the readme is straightforward for LTR script. However, is there a difference for RTL languages? Should the text string in .gt.txt be reversed? We are trying to train for Urdu but the final error rate is 90% or above for 444 line pairs (sample attached). We suspect that the direction is the cause. If that is indeed the case, should the text files by reversed at character level?

output.zip

Generated .box files have identical coordinates for every character

Environment:

tesseract 4.0.0
leptonica-1.76.0
libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE

Platfrom:

Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64

Current Behavior:

Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like
"ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.

The transcription generated by the new model can be found below (Fine Tuned.txt):
Fine Tuned.txt

The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt):
Arabic Trained Model.txt

The fine tuned model can be found below (test1.traineddata):
test1.traineddata.zip

I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.

A sample of my training data which includes the .box and .lstmf is attached below:
training data.zip

'ascii' codec can't encode character u'\u0645' in position 0: ordinal not in range(128)

I am trying to run Urdu training data (using Noori Nastaleeq font) but make training urd results in the following:

python generate_line_box.py -i "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.tif" -t "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.gt.txt" > "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 41, in <module>
    print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0645' in position 0: ordinal not in range(128)
Makefile:111: recipe for target 'data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box' failed

I have attached the problematic line image with .gt.txt. The files are generated on Windows uisng GDI and .net and imported to Linux. Putting urd.traineddata beforehand doesn't help as well.
output.zip

Finetuning in ocrd-train

In the Makefile, here is the code for finetuning

ifdef START_MODEL
$(LAST_CHECKPOINT): unicharset lists $(PROTO_MODEL)
mkdir -p data/checkpoints
lstmtraining
--traineddata $(PROTO_MODEL)
--old_traineddata $(TESSDATA)/$(START_MODEL).traineddata
--continue_from data/$(START_MODEL)/$(START_MODEL).lstm
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset]"
--model_output data/checkpoints/$(MODEL_NAME)
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000

Why do we need the following line? I thought it was only used in training from scratch.
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset]" \

Should the learning rate be set lower for fine-tuning? The learning rate for training from scratch is 20e-4, so it would seem that the learning rate for fine-tuning should be significantly lower?
--learning_rate 20e-4 \

Remove dependency from langdata repo.

It turned out that only the file radical-stroke.txt is needed for bootstrapping.

unicode error python

I modified the .py file because I kept getting a unicode error:

I just inserted three more lines of code at the beggining of the file

import io
import argparse
import unicodedata
from PIL import Image
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

cheers

Failed to load list of eval filenames from data/list.eval / Segmentation fault

Hi guys,

I’m pretty new to this, so please forgive if I’m missing something obvious.

I’ve posted to the mailing list because Tesseract sometimes confuses the digit 4 with a 9 in the material I’m currently processing. Someone over there pointed me to your project.

So what I would like to do now is to finetune the Latin script to fix the recognition errors I’m seeing. If I understand this correctly, I’ll need to go to data/ground-truth and create files there for Tesseract to learn from, e.g:

April_2014.gt.txt

April 2014

April_2014.tif

Also, I’ve cloned https://github.com/tesseract-ocr/tessdata.

What else do I need to do? The reason I’m asking is because I get an error when executing the following:

$ make -j4 training START_MODEL=Latin TESSDATA=/home/vagrant/tessdata/script
python generate_line_box.py -i "data/ground-truth/April_2012.tif" -t "data/ground-truth/April_2012.gt.txt" > "data/ground-truth/April_2012.box"
python generate_line_box.py -i "data/ground-truth/April_2013.tif" -t "data/ground-truth/April_2013.gt.txt" > "data/ground-truth/April_2013.box"
python generate_line_box.py -i "data/ground-truth/April_2014.tif" -t "data/ground-truth/April_2014.gt.txt" > "data/ground-truth/April_2014.box"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-11 00:00:10--  https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... tesseract data/ground-truth/April_2012.tif data/ground-truth/April_2012 --psm 6 lstm.train
tesseract data/ground-truth/April_2014.tif data/ground-truth/April_2014 --psm 6 lstm.train
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
connected.
tesseract data/ground-truth/April_2013.tif data/ground-truth/April_2013 --psm 6 lstm.train
HTTP request sent, awaiting response... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
mkdir -p data/Latin
combine_tessdata -u /home/vagrant/tessdata/script/Latin.traineddata  data/Latin/Latin
302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-11 00:00:10--  https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’

data/radical-stroke.txt                                        0%[                                                                                                                                               ]       0  --.-KB/s               Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Extracting tessdata components from /home/vagrant/tessdata/script/Latin.traineddata
Wrote data/Latin/Latin.lstm
Wrote data/Latin/Latin.lstm-punc-dawg
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
data/radical-stroke.txt                                      100%[==============================================================================================================================================>] 323.12K  --.-KB/s    in 0.1s

2018-12-11 00:00:10 (2.40 MB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]

total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
Wrote data/Latin/Latin.lstm-word-dawg
Wrote data/Latin/Latin.lstm-number-dawg
Wrote data/Latin/Latin.lstm-unicharset
Wrote data/Latin/Latin.lstm-recoder
Wrote data/Latin/Latin.version
Version string:4.00.00alpha:Latin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=1587099, offset=192
18:lstm-punc-dawg:size=5954, offset=1587291
19:lstm-word-dawg:size=88816882, offset=1593245
20:lstm-number-dawg:size=86050, offset=90410127
21:lstm-unicharset:size=18023, offset=90496177
22:lstm-recoder:size=2735, offset=90514200
23:version:size=82, offset=90516935
unicharset_extractor --output_unicharset "data/ground-truth/my.unicharset" --norm_mode 2 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Other case a of A is not in unicharset
Other case P of p is not in unicharset
Other case R of r is not in unicharset
Other case I of i is not in unicharset
Other case L of l is not in unicharset
Wrote unicharset file data/ground-truth/my.unicharset
merge_unicharsets data/Latin/Latin.lstm-unicharset data/ground-truth/my.unicharset  "data/unicharset"
Loaded unicharset of size 303 from file data/Latin/Latin.lstm-unicharset
Loaded unicharset of size 13 from file data/ground-truth/my.unicharset
Wrote unicharset file data/unicharset.
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir data/ \
  --output_dir data/ \
  --lang foo
Loaded unicharset of size 303 from file data/unicharset
Setting unichar properties
Other case Ẹ̀ of ẹ̀ is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = K
#
# ... truncated ...
#
Warning: properties incomplete for index 302 = ẹ̀
Config file is optional, continuing...
Failed to read data from: data//foo/foo.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/foo/foo.traineddata \
          --old_traineddata /home/vagrant/tessdata/script/Latin.traineddata \
  --continue_from data/Latin/Latin.lstm \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/foo \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000
Loaded file data/Latin/Latin.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 302 to 302!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc302:302, 0
Total weights = 1404064
Previous null char=301 mapped to 301
Continuing from data/Latin/Latin.lstm
Loaded 1/1 pages (1-1) of document data/ground-truth/April_2014.lstmf
Failed to load list of eval filenames from data/list.eval
Failed to load eval data from: data/list.eval
Makefile:131: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** [data/checkpoints/foo_checkpoint] Segmentation fault (core dumped)

Looks like it wants a data/list.eval file which isn’t there. Is this why it’s crashing?

I’m running this on Ubuntu 16.04.

Thank you!

Warning: LSTMTrainer deserialized an LSTMRecognizer!

I am training marathi language but i am getting following warning which resulting in training error rate around 100%

Log generated during training

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Loaded 1/1 pages (1-1) of document data/ground-truth/bori-g0351-14-026.exp0.lstmf
At iteration 1189/1200/1200, Mean rms=6.836%, delta=55.863%, char train=111.93%, word train=99.35%, skip ratio=0%, New worst char error = 111.93 wrote checkpoint.

Loaded 1/1 pages (1-1) of document data/ground-truth/bori-g0351-14-014.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1289/1300/1300, Mean rms=6.812%, delta=55.992%, char train=111.945%, word train=98.95%, skip ratio=0%, New worst char error = 111.945At iteration 1090, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1389/1400/1400, Mean rms=6.806%, delta=56.201%, char train=114.316%, word train=98.65%, skip ratio=0%, New worst char error = 114.316At iteration 1189, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.
.
.
.
.
.
.

Finished! Error rate = 87.055
debug_interval -1
/bin/bash: debug_interval: command not found
make: [Makefile:148: data/checkpoints/mar10_checkpoint] Error 127 (ignored)

[Question] High error rate after training - why?

I'm training with a new set of two fonts and the goal is to use Tesseract to analyze individual chars (only capital letters and numbers), not entire words, but the results are far from being decent and it looks like I'm doing something wrong. Tesseract and Leptonica are installed by the scripts.

Inspired by the test set provided in this repo, I created these tif files with their correct gt.txt's:

From original binarized chars:

From two TTFs to TIF images with random text:

At the end of the data creation process I have 1869 mixed text lines.

First I ran the makefile with 10000 iterations as default, but the best error rate was high.
I thought it was a matter of needing more iterations, so I changed to 30000 but nothing got better. The following image shows the results:

Sometimes I can see the char train increasing instead of getting lower.
What am I missing here? Do I need more training data? Is my initial data not following any important concept?
I'd appreciate any help!

make: *** [data/unicharset] Error

Hi @kba

I have tried the code of tried

combine_tessdata -u /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.traineddata  /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.
Failed to read /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.traineddata
Makefile:98: recipe for target 'data/unicharset' failed

I haven't seen any file called foo.traineddata in the tessdata directory. what do i need to do here

when I change foo.traineddata with eng.traineddata . this command works fine which is expected as combine_tessdata -u extract exiting traineddata
Thanks in advance

generate_line_box should be py 2/3 agnostic

E.g. do not use file constructor function, #9

Page level images

The script works for line level images.

I have a number of scanned page images with ground truth files.

Does OCR-D project have any tools to segment it to line images with corresponding ground truth text?

UnicodeEncodeError: 'ascii' codec can't encode character in Python3

I have seen a closed issue about this problem before. As suggested I switched to python3 but the problem still persists.
Here is the output log

python generate_line_box.py -i "data/train/alexis_ruhe01_1852_0018_022.tif" -t "data/train/alexis_ruhe01_1852_0018_022.gt.txt" > "data/train/alexis_ruhe01_1852_0018_022.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 40, in <module>
    print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128)
Makefile:110: recipe for target 'data/train/alexis_ruhe01_1852_0018_022.box' failed
make: *** [data/train/alexis_ruhe01_1852_0018_022.box] Error 1

Incorrect train/eval split

I think this line, in the "data/list.eval" block:

tail -n "+$$no" $(ALL_LSTMF) > "$@"

should be:

tail -n "$$no" $(ALL_LSTMF) > "$@"

Add PAGE-XML input functionalities

Ground Truth within OCR-D is mainly represented in PAGE XML. Training based on input files in this format is highly desirable.