tesseract-ocr / tesstrain Goto Github PK
View Code? Open in Web Editor NEWTrain Tesseract LSTM with make
License: Apache License 2.0
Train Tesseract LSTM with make
License: Apache License 2.0
data/train
is misleading since it contains both training and evaluation data.
We have trained tesseract with custom data having 2000 images for 10k iteration.The size of the trained file (digitsmodel.traineddata) is very less (5.1 KB).
When we are testing the newly trained model,we are getting the following error
raise TesseractError(status_code, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Error: LSTM requested, but not present!! Loading tesseract. Failed loading language 'digitsmodel' Tesseract couldn't load any languages! Could not initialize tesseract.")
Hi !
Please add license info.
That's the first thing I search in any project that looks interesting.
OS:
Ubuntu 18.04
What I typed in Terminal:
make training
What I received:
python generate_line_box.py -i "data/ground-truth/andreas_fenitschka_1898_0085_025.tif" -t "data/ground-truth/andreas_fenitschka_1898_0085_025.gt.txt" > "data/ground-truth/andreas_fenitschka_1898_0085_025.box"
Traceback (most recent call last):
File "generate_line_box.py", line 41, in
print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128)
Makefile:111: recipe for target 'data/ground-truth/andreas_fenitschka_1898_0085_025.box' failed
make: *** [data/ground-truth/andreas_fenitschka_1898_0085_025.box] Error 1
I use a modified version of makefile for finetuning. Please add the relevant functionality from file below to the current makefile or add new versions for finetuning and replace a layer training.
export
SHELL := /bin/bash
LOCAL := $(PWD)/usr
PATH := $(LOCAL)/bin:$(PATH)
HOME := /home/ubuntu
TESSDATA = $(HOME)/tessdata_best
LANGDATA = $(HOME)/langdata
# Name of the model to be built
MODEL_NAME = frk
# Name of the model to continue from
CONTINUE_FROM = frk
# Normalization Mode - see src/training/language_specific.sh for details
NORM_MODE = 2
# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best
# Train directory
TRAIN := data/train
# BEGIN-EVAL makefile-parser --make-help Makefile
help:
@echo ""
@echo " Targets"
@echo ""
@echo " unicharset Create unicharset"
@echo " lists Create lists of lstmf filenames for training and eval"
@echo " training Start training"
@echo " proto-model Build the proto model"
@echo " leptonica Build leptonica"
@echo " tesseract Build tesseract"
@echo " tesseract-langs Download tesseract-langs"
@echo " langdata Download langdata"
@echo " clean Clean all generated files"
@echo ""
@echo " Variables"
@echo ""
@echo " MODEL_NAME Name of the model to be built"
@echo " CORES No of cores to use for compiling leptonica/tesseract"
@echo " LEPTONICA_VERSION Leptonica version. Default: $(LEPTONICA_VERSION)"
@echo " TESSERACT_VERSION Tesseract commit. Default: $(TESSERACT_VERSION)"
@echo " LANGDATA_VERSION Tesseract langdata version. Default: $(LANGDATA_VERSION)"
@echo " TESSDATA_REPO Tesseract model repo to use. Default: $(TESSDATA_REPO)"
@echo " TRAIN Train directory"
@echo " RATIO_TRAIN Ratio of train / eval training data"
# END-EVAL
# Ratio of train / eval training data
RATIO_TRAIN := 0.90
ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf
# Create unicharset
unicharset: data/unicharset
# Create lists of lstmf filenames for training and eval
lists: $(ALL_LSTMF) data/list.train data/list.eval
data/list.train: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
head -n "$$no" $(ALL_LSTMF) > "$@"
data/list.eval: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
tail -n "+$$no" $(ALL_LSTMF) > "$@"
# Start training
training: data/$(MODEL_NAME).traineddata
data/unicharset: $(ALL_BOXES)
combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > "$@"
$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
$(TRAIN)/%.lstmf: $(TRAIN)/%.box
tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train
# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir $(LANGDATA) \
--words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
--numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
--puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
--output_dir data/ \
--lang $(MODEL_NAME)
data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
mkdir -p data/checkpoints
lstmtraining \
--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output data/checkpoints/$(MODEL_NAME) \
--debug_interval -1 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--sequential_training \
--max_iterations 3000
data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
lstmtraining \
--stop_training \
--continue_from $^ \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output $@
# Clean all generated files
clean:
find data/train -name '*.box' -delete
find data/train -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/$(MODEL_NAME)
rm -rf data/unicharset
rm -rf data/checkpoints
we are training data from scratch using ocrd- train for devnagari script. We train from following samples images which uses 10 line samples to trained data.
training log after make training
python generate_line_box.py -i "data/ground-truth/marathi1-001.exp0.tif" -t "data/ground-truth/marathi1-001.exp0.gt.txt" > "data/ground-truth/marathi1-001.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-002.exp0.tif" -t "data/ground-truth/marathi1-002.exp0.gt.txt" > "data/ground-truth/marathi1-002.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-003.exp0.tif" -t "data/ground-truth/marathi1-003.exp0.gt.txt" > "data/ground-truth/marathi1-003.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-004.exp0.tif" -t "data/ground-truth/marathi1-004.exp0.gt.txt" > "data/ground-truth/marathi1-004.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-005.exp0.tif" -t "data/ground-truth/marathi1-005.exp0.gt.txt" > "data/ground-truth/marathi1-005.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-006.exp0.tif" -t "data/ground-truth/marathi1-006.exp0.gt.txt" > "data/ground-truth/marathi1-006.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-007.exp0.tif" -t "data/ground-truth/marathi1-007.exp0.gt.txt" > "data/ground-truth/marathi1-007.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-008.exp0.tif" -t "data/ground-truth/marathi1-008.exp0.gt.txt" > "data/ground-truth/marathi1-008.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-009.exp0.tif" -t "data/ground-truth/marathi1-009.exp0.gt.txt" > "data/ground-truth/marathi1-009.exp0.box" python generate_line_box.py -i "data/ground-truth/marathi1-010.exp0.tif" -t "data/ground-truth/marathi1-010.exp0.gt.txt" > "data/ground-truth/marathi1-010.exp0.box" find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes" unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes" Extracting unicharset from box file data/all-boxes Wrote unicharset file data/unicharset tesseract data/ground-truth/marathi1-001.exp0.tif data/ground-truth/marathi1-001.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-002.exp0.tif data/ground-truth/marathi1-002.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-003.exp0.tif data/ground-truth/marathi1-003.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-004.exp0.tif data/ground-truth/marathi1-004.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-005.exp0.tif data/ground-truth/marathi1-005.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-006.exp0.tif data/ground-truth/marathi1-006.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-007.exp0.tif data/ground-truth/marathi1-007.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-008.exp0.tif data/ground-truth/marathi1-008.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-009.exp0.tif data/ground-truth/marathi1-009.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 tesseract data/ground-truth/marathi1-010.exp0.tif data/ground-truth/marathi1-010.exp0 --psm 6 lstm.train Tesseract Open Source OCR Engine v4.0.0 with Leptonica Page 1 find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf" total=
cat data/all-lstmf | wc -l \ no=
echo "$total * 0.90 / 1" | bc; \ head -n "$no" data/all-lstmf > "data/list.train" total=
cat data/all-lstmf | wc -l \ no=
echo "($total - $total * 0.90) / 1" | bc; \ tail -n "$no" data/all-lstmf > "data/list.eval" combine_lang_model \ --input_unicharset data/unicharset \ --script_dir data/ \ --output_dir data/ \ --lang foo Loaded unicharset of size 29 from file data/unicharset Setting unichar properties Setting script properties Failed to load script unicharset from:data//Devanagari.unicharset Warning: properties incomplete for index 3 = ग Warning: properties incomplete for index 4 = ण Warning: properties incomplete for index 5 = ज Warning: properties incomplete for index 6 = र Warning: properties incomplete for index 7 = म Warning: properties incomplete for index 8 = न Warning: properties incomplete for index 9 = क Warning: properties incomplete for index 11 = व Warning: properties incomplete for index 12 = उ Warning: properties incomplete for index 13 = ळ Warning: properties incomplete for index 14 = घ Warning: properties incomplete for index 15 = ड Warning: properties incomplete for index 16 = ए Warning: properties incomplete for index 17 = अ Warning: properties incomplete for index 18 = ह Warning: properties incomplete for index 19 = द Warning: properties incomplete for index 20 = ब Warning: properties incomplete for index 21 = ल Warning: properties incomplete for index 23 = प Warning: properties incomplete for index 24 = ट Warning: properties incomplete for index 25 = च Warning: properties incomplete for index 26 = त Warning: properties incomplete for index 27 = ध Warning: properties incomplete for index 28 = फ Config file is optional, continuing... Failed to read data from: data//foo/foo.config Null char=2 mkdir -p data/checkpoints lstmtraining \ --traineddata data/foo/foo.traineddata \ --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c
head -n1 data/unicharset`]"
--model_output data/checkpoints/foo
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000
Warning: given outputs 29 not equal to unicharset of 28.
Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc28:28, 7196
Total weights = 511100
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc28] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c29]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=27
Loaded 16/16 pages (1-16) of document data/ground-truth/marathi1-001.exp0.lstmf
Loaded 14/14 pages (1-14) of document data/ground-truth/marathi1-008.exp0.lstmf
Loaded 14/14 pages (1-14) of document data/ground-truth/marathi1-005.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-007.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-010.exp0.lstmf
Loaded 25/25 pages (1-25) of document data/ground-truth/marathi1-009.exp0.lstmf
Loaded 24/24 pages (1-24) of document data/ground-truth/marathi1-003.exp0.lstmf
Loaded 26/26 pages (1-26) of document data/ground-truth/marathi1-002.exp0.lstmf
Loaded 21/21 pages (1-21) of document data/ground-truth/marathi1-004.exp0.lstmf
Loaded 25/25 pages (1-25) of document data/ground-truth/marathi1-006.exp0.lstmf
At iteration 99/100/100, Mean rms=7.226%, delta=3.012%, char train=267%, word train=95%, skip ratio=0%, New worst char error = 267 wrote checkpoint.
At iteration 199/200/200, Mean rms=7.06%, delta=2.755%, char train=189.5%, word train=95.5%, skip ratio=0%, New worst char error = 189.5 wrote checkpoint.
At iteration 290/300/300, Mean rms=7.036%, delta=2.443%, char train=164.667%, word train=96.667%, skip ratio=0%, New worst char error = 164.667 wrote checkpoint.
At iteration 387/400/400, Mean rms=7.074%, delta=2.502%, char train=160.5%, word train=96.75%, skip ratio=0%, New worst char error = 160.5 wrote checkpoint.
At iteration 480/500/500, Mean rms=7.07%, delta=2.291%, char train=148.6%, word train=97.4%, skip ratio=0%, New worst char error = 148.6 wrote checkpoint.
At iteration 566/600/600, Mean rms=7.111%, delta=2.129%, char train=140.5%, word train=97.833%, skip ratio=0%, New worst char error = 140.5 wrote checkpoint.
At iteration 644/700/700, Mean rms=7.153%, delta=2.117%, char train=135.786%, word train=98.143%, skip ratio=0%, New worst char error = 135.786 wrote checkpoint.
At iteration 707/800/800, Mean rms=7.186%, delta=2.221%, char train=135.375%, word train=98.375%, skip ratio=0%, New worst char error = 135.375 wrote checkpoint.
At iteration 779/900/900, Mean rms=7.204%, delta=2.397%, char train=140.944%, word train=98.556%, skip ratio=0%, New worst char error = 140.944 wrote checkpoint.
At iteration 863/1000/1000, Mean rms=7.187%, delta=2.838%, char train=154.15%, word train=98.6%, skip ratio=0%, New worst char error = 154.15 wrote checkpoint.
At iteration 956/1100/1100, Mean rms=7.088%, delta=3.46%, char train=147.1%, word train=98.4%, skip ratio=0%, New worst char error = 147.1 wrote checkpoint.
At iteration 1023/1200/1200, Mean rms=7.12%, delta=3.307%, char train=150.85%, word train=98.8%, skip ratio=0%, New worst char error = 150.85 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1110/1300/1300, Mean rms=7.109%, delta=3.743%, char train=161.15%, word train=98.6%, skip ratio=0%, New worst char error = 161.15 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1198/1400/1400, Mean rms=7.1%, delta=4.097%, char train=180.9%, word train=98.9%, skip ratio=0%, New worst char error = 180.9At iteration 1023, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1297/1500/1500, Mean rms=7.025%, delta=4.76%, char train=194.95%, word train=98.8%, skip ratio=0%, New worst char error = 194.95At iteration 1110, stage 0, Eval Char error rate=498.07692, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1394/1600/1600, Mean rms=6.898%, delta=5.583%, char train=216.4%, word train=98.8%, skip ratio=0%, New worst char error = 216.4At iteration 1198, stage 0, Eval Char error rate=401.92308, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1481/1700/1700, Mean rms=6.778%, delta=6.239%, char train=242.95%, word train=98.8%, skip ratio=0%, New worst char error = 242.95At iteration 1297, stage 0, Eval Char error rate=169.23077, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1553/1800/1800, Mean rms=6.624%, delta=6.812%, char train=251.35%, word train=98.1%, skip ratio=0%, New worst char error = 251.35At iteration 1394, stage 0, Eval Char error rate=426.92308, Word error rate=100 wrote checkpoint.
At iteration 1628/1900/1900, Mean rms=6.422%, delta=7.195%, char train=243.8%, word train=96.9%, skip ratio=0%, wrote checkpoint.
At iteration 1715/2000/2000, Mean rms=6.287%, delta=7.481%, char train=231.55%, word train=95.7%, skip ratio=0%, wrote checkpoint.
At iteration 1812/2100/2100, Mean rms=6.187%, delta=7.493%, char train=222.05%, word train=95.3%, skip ratio=0%, wrote checkpoint.
At iteration 1910/2200/2200, Mean rms=5.954%, delta=8.482%, char train=215.4%, word train=93.4%, skip ratio=0%, wrote checkpoint.
At iteration 2010/2300/2300, Mean rms=5.696%, delta=8.712%, char train=203.35%, word train=92.2%, skip ratio=0%, wrote checkpoint.
At iteration 2109/2400/2400, Mean rms=5.341%, delta=8.584%, char train=180.55%, word train=90.7%, skip ratio=0%, wrote checkpoint.
At iteration 2209/2500/2500, Mean rms=4.999%, delta=8.082%, char train=166.15%, word train=89.5%, skip ratio=0%, wrote checkpoint.
At iteration 2309/2600/2600, Mean rms=4.668%, delta=7.392%, char train=144.55%, word train=87.9%, skip ratio=0%, wrote checkpoint.
At iteration 2409/2700/2700, Mean rms=4.319%, delta=6.791%, char train=116.85%, word train=86%, skip ratio=0%, wrote checkpoint.
At iteration 2509/2800/2800, Mean rms=3.999%, delta=6.177%, char train=104.55%, word train=84.6%, skip ratio=0%, wrote checkpoint.
At iteration 2607/2900/2900, Mean rms=3.959%, delta=6.183%, char train=116.2%, word train=85.2%, skip ratio=0%, wrote checkpoint.
At iteration 2706/3000/3000, Mean rms=3.995%, delta=5.89%, char train=121.1%, word train=86.1%, skip ratio=0%, wrote checkpoint.
At iteration 2794/3100/3100, Mean rms=4.082%, delta=5.492%, char train=135%, word train=86.5%, skip ratio=0%, wrote checkpoint.
At iteration 2886/3200/3200, Mean rms=4.173%, delta=5.168%, char train=145.8%, word train=87.8%, skip ratio=0%, wrote checkpoint.
At iteration 2966/3300/3300, Mean rms=4.274%, delta=4.768%, char train=147.05%, word train=87.7%, skip ratio=0%, wrote checkpoint.
At iteration 3013/3400/3400, Mean rms=4.391%, delta=4.429%, char train=142.05%, word train=87.6%, skip ratio=0%, wrote checkpoint.
At iteration 3084/3500/3500, Mean rms=4.628%, delta=4.723%, char train=140.35%, word train=87%, skip ratio=0%, wrote checkpoint.
At iteration 3183/3600/3600, Mean rms=4.839%, delta=5.391%, char train=137.25%, word train=87.1%, skip ratio=0%, wrote checkpoint.
At iteration 3278/3700/3700, Mean rms=4.998%, delta=5.822%, char train=138.65%, word train=87.1%, skip ratio=0%, wrote checkpoint.
At iteration 3376/3800/3800, Mean rms=5.113%, delta=6.118%, char train=138.6%, word train=87.5%, skip ratio=0%, wrote checkpoint.
At iteration 3473/3900/3900, Mean rms=4.971%, delta=5.798%, char train=126.35%, word train=86.3%, skip ratio=0%, wrote checkpoint.
At iteration 3571/4000/4000, Mean rms=4.692%, delta=5.524%, char train=116.6%, word train=85.3%, skip ratio=0%, wrote checkpoint.
At iteration 3659/4100/4100, Mean rms=4.75%, delta=6.248%, char train=119.7%, word train=85.5%, skip ratio=0%, wrote checkpoint.
At iteration 3752/4200/4200, Mean rms=4.531%, delta=5.832%, char train=112.7%, word train=84.5%, skip ratio=0%, wrote checkpoint.
At iteration 3845/4300/4300, Mean rms=4.32%, delta=5.683%, char train=111%, word train=84.5%, skip ratio=0%, wrote checkpoint.
At iteration 3942/4400/4400, Mean rms=4.164%, delta=5.861%, char train=114.7%, word train=84.7%, skip ratio=0%, wrote checkpoint.
At iteration 4039/4500/4500, Mean rms=3.936%, delta=5.578%, char train=115.05%, word train=84.6%, skip ratio=0%, wrote checkpoint.
At iteration 4135/4600/4600, Mean rms=3.752%, delta=4.956%, char train=118.75%, word train=84.9%, skip ratio=0%, wrote checkpoint.
At iteration 4234/4700/4700, Mean rms=3.61%, delta=4.569%, char train=116.45%, word train=84.9%, skip ratio=0%, wrote checkpoint.
At iteration 4334/4800/4800, Mean rms=3.518%, delta=4.338%, char train=116.95%, word train=84.8%, skip ratio=0%, wrote checkpoint.
At iteration 4434/4900/4900, Mean rms=3.452%, delta=4.207%, char train=116.65%, word train=84.7%, skip ratio=0%, wrote checkpoint.
At iteration 4534/5000/5000, Mean rms=3.411%, delta=4.13%, char train=115.6%, word train=84.3%, skip ratio=0%, wrote checkpoint.
2 Percent improvement time=4634, best error was 100 @ 0
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 4634/5100/5100, Mean rms=3.028%, delta=3.178%, char train=99.1%, word train=83.4%, skip ratio=0%, New best char error = 99.1At iteration 1481, stage 0, Eval Char error rate=130.76923, Word error rate=100 wrote checkpoint.
2 Percent improvement time=4734, best error was 100 @ 0
At iteration 4734/5200/5200, Mean rms=2.955%, delta=3.145%, char train=97.15%, word train=83.4%, skip ratio=0%, New best char error = 97.15 wrote checkpoint.
At iteration 4834/5300/5300, Mean rms=2.915%, delta=3.138%, char train=98.65%, word train=83.8%, skip ratio=0%, New worst char error = 98.65 wrote checkpoint.
At iteration 4934/5400/5400, Mean rms=2.877%, delta=3.082%, char train=97.45%, word train=83.1%, skip ratio=0%, New worst char error = 97.45 wrote checkpoint.
At iteration 5034/5500/5500, Mean rms=2.852%, delta=3.031%, char train=99.25%, word train=83.5%, skip ratio=0%, New worst char error = 99.25 wrote checkpoint.
At iteration 5134/5600/5600, Mean rms=2.825%, delta=2.988%, char train=97.9%, word train=83%, skip ratio=0%, New worst char error = 97.9 wrote checkpoint.
At iteration 5234/5700/5700, Mean rms=2.807%, delta=2.946%, char train=100%, word train=83.1%, skip ratio=0%, New worst char error = 100 wrote checkpoint.
At iteration 5334/5800/5800, Mean rms=2.788%, delta=2.886%, char train=99.7%, word train=83.4%, skip ratio=0%, New worst char error = 99.7 wrote checkpoint.
At iteration 5434/5900/5900, Mean rms=2.771%, delta=2.819%, char train=99.85%, word train=83.8%, skip ratio=0%, New worst char error = 99.85 wrote checkpoint.
At iteration 5534/6000/6000, Mean rms=2.76%, delta=2.757%, char train=101.25%, word train=84.2%, skip ratio=0%, New worst char error = 101.25 wrote checkpoint.
At iteration 5634/6100/6100, Mean rms=2.747%, delta=2.709%, char train=101.25%, word train=84.5%, skip ratio=0%, New worst char error = 101.25 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 5734/6200/6200, Mean rms=2.735%, delta=2.655%, char train=101.05%, word train=84.4%, skip ratio=0%, New worst char error = 101.05At iteration 1553, stage 0, Eval Char error rate=267.30769, Word error rate=100 wrote checkpoint.
At iteration 5834/6300/6300, Mean rms=2.726%, delta=2.623%, char train=100.6%, word train=84.3%, skip ratio=0%, wrote checkpoint.
At iteration 5934/6400/6400, Mean rms=2.716%, delta=2.594%, char train=100.95%, word train=84.7%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6034/6500/6500, Mean rms=2.721%, delta=2.591%, char train=101.1%, word train=84.8%, skip ratio=0%, New worst char error = 101.1At iteration 4734, stage 0, Eval Char error rate=151.92308, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6134/6600/6600, Mean rms=2.718%, delta=2.578%, char train=102.95%, word train=84.9%, skip ratio=0%, New worst char error = 102.95At iteration 5734, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.
At iteration 6234/6700/6700, Mean rms=2.729%, delta=2.585%, char train=102.05%, word train=85%, skip ratio=0%, wrote checkpoint.
At iteration 6334/6800/6800, Mean rms=2.73%, delta=2.581%, char train=102.9%, word train=85.4%, skip ratio=0%, wrote checkpoint.
At iteration 6434/6900/6900, Mean rms=2.743%, delta=2.604%, char train=102.75%, word train=85.2%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6534/7000/7000, Mean rms=2.742%, delta=2.593%, char train=103.15%, word train=85.6%, skip ratio=0%, New worst char error = 103.15At iteration 6034, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.
At iteration 6634/7100/7100, Mean rms=2.746%, delta=2.596%, char train=102.05%, word train=84.7%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6734/7200/7200, Mean rms=2.747%, delta=2.59%, char train=103.4%, word train=85.3%, skip ratio=0%, New worst char error = 103.4At iteration 6134, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.
At iteration 6834/7300/7300, Mean rms=2.753%, delta=2.6%, char train=102.8%, word train=84.7%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 6934/7400/7400, Mean rms=2.76%, delta=2.614%, char train=103.9%, word train=85.1%, skip ratio=0%, New worst char error = 103.9At iteration 6534, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.
At iteration 7034/7500/7500, Mean rms=2.76%, delta=2.613%, char train=103.85%, word train=85%, skip ratio=0%, wrote checkpoint.
At iteration 7134/7600/7600, Mean rms=2.771%, delta=2.633%, char train=103.45%, word train=85%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7234/7700/7700, Mean rms=2.77%, delta=2.628%, char train=104.55%, word train=85.5%, skip ratio=0%, New worst char error = 104.55At iteration 6734, stage 0, Eval Char error rate=176.92308, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7334/7800/7800, Mean rms=2.779%, delta=2.647%, char train=104.85%, word train=85.2%, skip ratio=0%, New worst char error = 104.85At iteration 6934, stage 0, Eval Char error rate=165.38462, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7434/7900/7900, Mean rms=2.775%, delta=2.633%, char train=105.7%, word train=85.8%, skip ratio=0%, New worst char error = 105.7At iteration 7234, stage 0, Eval Char error rate=190.38462, Word error rate=100 wrote checkpoint.
At iteration 7534/8000/8000, Mean rms=2.782%, delta=2.656%, char train=105.2%, word train=84.4%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7634/8100/8100, Mean rms=2.781%, delta=2.649%, char train=106.8%, word train=85.1%, skip ratio=0%, New worst char error = 106.8At iteration 7334, stage 0, Eval Char error rate=213.46154, Word error rate=100 wrote checkpoint.
At iteration 7734/8200/8200, Mean rms=2.785%, delta=2.665%, char train=106.35%, word train=84.3%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 7834/8300/8300, Mean rms=2.788%, delta=2.664%, char train=108.05%, word train=84.8%, skip ratio=0%, New worst char error = 108.05At iteration 7434, stage 0, Eval Char error rate=144.23077, Word error rate=100 wrote checkpoint.
At iteration 7934/8400/8400, Mean rms=2.794%, delta=2.677%, char train=107.4%, word train=84.1%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8034/8500/8500, Mean rms=2.799%, delta=2.682%, char train=109.45%, word train=84.9%, skip ratio=0%, New worst char error = 109.45At iteration 7634, stage 0, Eval Char error rate=76.923077, Word error rate=76.923077 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8134/8600/8600, Mean rms=2.796%, delta=2.669%, char train=110.05%, word train=85%, skip ratio=0%, New worst char error = 110.05At iteration 7834, stage 0, Eval Char error rate=128.84615, Word error rate=96.153846 wrote checkpoint.
At iteration 8234/8700/8700, Mean rms=2.792%, delta=2.663%, char train=109.35%, word train=84.5%, skip ratio=0%, wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8334/8800/8800, Mean rms=2.898%, delta=3.043%, char train=111.75%, word train=84.7%, skip ratio=0%, New worst char error = 111.75At iteration 8034, stage 0, Eval Char error rate=76.923077, Word error rate=76.923077 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8434/8900/8900, Mean rms=3.051%, delta=3.584%, char train=119.75%, word train=85.3%, skip ratio=0%, New worst char error = 119.75At iteration 8134, stage 0, Eval Char error rate=194.23077, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8533/9000/9000, Mean rms=3.331%, delta=4.42%, char train=127.6%, word train=86.9%, skip ratio=0%, New worst char error = 127.6At iteration 8334, stage 0, Eval Char error rate=150, Word error rate=100 wrote checkpoint.
At iteration 8632/9100/9100, Mean rms=3.371%, delta=4.523%, char train=126.95%, word train=87.5%, skip ratio=0%, wrote checkpoint.
At iteration 8732/9200/9200, Mean rms=3.364%, delta=4.511%, char train=126.25%, word train=87.9%, skip ratio=0%, wrote checkpoint.
At iteration 8832/9300/9300, Mean rms=3.35%, delta=4.496%, char train=124.9%, word train=87.4%, skip ratio=0%, wrote checkpoint.
At iteration 8932/9400/9400, Mean rms=3.35%, delta=4.518%, char train=125.5%, word train=87.9%, skip ratio=0%, wrote checkpoint.
At iteration 9031/9500/9500, Mean rms=3.383%, delta=4.664%, char train=124.6%, word train=87.3%, skip ratio=0%, wrote checkpoint.
At iteration 9130/9600/9600, Mean rms=3.393%, delta=4.727%, char train=124.5%, word train=87.2%, skip ratio=0%, wrote checkpoint.
At iteration 9228/9700/9700, Mean rms=3.401%, delta=4.76%, char train=124.95%, word train=87.4%, skip ratio=0%, wrote checkpoint.
At iteration 9328/9800/9800, Mean rms=3.293%, delta=4.384%, char train=122.7%, word train=86.8%, skip ratio=0%, wrote checkpoint.
At iteration 9428/9900/9900, Mean rms=3.14%, delta=3.847%, char train=113.95%, word train=85.6%, skip ratio=0%, wrote checkpoint.
At iteration 9528/10000/10000, Mean rms=2.847%, delta=2.985%, char train=106.2%, word train=85.3%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 97.15
`
after performing ocr using newly trained data it gives following output text for sample image
`
नर
टन
नरण
टन
टन
न र
नट
`
please suggest what is going wrong as to improve ocr accuracy.
data/unicharset
is rebuilt everytime I run make proto-model
(leading to errors since the underlying all-boxes file has been already deleted)
Hi,
What is the status of finetuning? Could you provide an example in README?
I've downloaded eng.traineddata from https://github.com/tesseract-ocr/tessdata_best, put it into data/eng/eng.traineddata directory and set START_MODEL=eng in Makefile.
But this gives me the following error in make training
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Makefile:131: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** No rule to make target 'langdata'. Stop.
Ubuntu 18.04
Hello .
I am facing an issue related to the Error rate I have added box file contains many lines break them by \t
but the Error rate is very high and the lstm model is not converging.
Here is a page sample example.
~/ocrd-train$ make unicharset
python generate_line_box.py -i "data/train/devatest-0001-010001.tif" -t "data/train/devatest-0001-010001-gt.txt" > "data/train/devatest-0001-010001.box"
Traceback (most recent call last):
File "generate_line_box.py", line 39, in <module>
if not unicodedata.combining(line[-1]):
IndexError: string index out of range
Makefile:92: recipe for target 'data/train/devatest-0001-010001.box' failed
make: *** [data/train/devatest-0001-010001.box] Error 1
... to something with tesseract
I have extracted ocrd-testset.zip to ./data/ground-truth.
I type the commands:
root@CUDA1:/home/ocrd-train# export PYTHONIOENCODING=utf8
root@CUDA1:/home/ocrd-train# make training
Output:
tesseract data/ground-truth/alexis_ruhe01_1852_0018_022.tif data/ground-truth/alexis_ruhe01_1852_0018_022 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/alexis_ruhe01_1852_0018_022.tif
.
.
.
tesseract data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif data/ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif
find data/ground-truth -name '*.lstmf' -exec echo {} ; | sort -R -o "data/all-lstmf"
total=cat data/all-lstmf | wc -l
no=echo "$total * 0.90 / 1" | bc
;
head -n "$no" data/all-lstmf > "data/list.train"
/bin/bash: bc: command not found
head: invalid number of lines: ''
Makefile:84: recipe for target 'data/list.train' failed
make: *** [data/list.train] Error 1
The list.train file is empty. Does anyone know how to fix this?
Hello guys
thanks for providing this tool. It eases one's life when using images instead of generating synthetic data for specific fonts.
I have been using it and thought that I have got it right! but when I checked the boxes files manually I found out that the box is the same for every character in the line segment and the final line segment is only 2*2 in width and height which is not the case!
Can you please guide me what I have done wrong!
I have only cloned the repo and i had tesseract and leptonica already built and installed and used the make training MODEL_NAME =mine
is this the correct box format to expect ? or am i missing something here?
P.S The Model is training fine and I am getting a final .traineddata and everything looked fine until I have checked this one because I was not very satisfied with the results.
Regards
This is not really an issue with ocrd-train and is regarding the following comment in makefile.
Normalization Mode - see src/training/language_specific.sh for details. Default: $(NORM_MODE)
Norm_modes
are also specified in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103
There is a conflict between the suggested norm_modes in language_specific.sh
and unicharset_extractor.cpp
.
I think that unicharset_extractor.cpp
is correct and will create a PR in tesseract repo for changing language_specific.sh
. See this comment for proposed change.
Meanwhile if you want you can change the comment in makefile.
I met this problem after training a model with OCRD,
in the terminal I input:
tesseract 5.2.tif output --psm 7 -l xxx
and I get this message:
Failed to load any lstm-specific dictionaries for lang tes!!
Tesseract Open Source OCR Engine v4.0.0-beta.4-138-g2093 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Anyone can help?
I have the dataset (images in tif format and transcription in .gt.txt format) and moved to /data/train folder.
Running the training command :
make training MODEL_NAME=name-of-the-resulting-model
gives me the following error:
make: *** No rule to make target 'JOB#4686', needed by 'data/all-boxes'. Stop.
When am trying with your sample dataset ocrd-testset training runs without any error.
Running generate_line_box.py
with my dataset yields the box values as expected.
Please suggest me what can be done or if its the issue with my dataset?
Attaching my sample dataset.(since tif not supported in github have attached jpeg)
I am trying to train a new language with ben.traineddata
. While providing sample training data with
lprBD-7.gt.txt and .tif image, I am getting the error
Can't encode transcription: '| ঢাকা মেটো-গ |' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff8d ffffffe0 ffffffa6 ffffffb0 ffffffe0 ffffffa7 ffffff8b 2d 20 ffffffe0 ffffffa6 ffffff97 20 7c 20 ffffffe0 ffffffa5 ffffffa4 Can't encode transcription: '\ ঢাকা মেট্রো- গ | ।' in language '' 2 Percent improvement time=0, best error was 7.2 @ 271 At iteration 271/2700/153861, Mean rms=0.211%, delta=0%, char train=0%, word train=0%, skip ratio=5600%, New best char error = 0 wrote best model:data/checkpoints/BigBenww0_271.checkpoint wrote checkpoint.
What changes should be made in order to train a new language?
I am having an issue similar to https://github.com/OCR-D/ocrd-train/issues/47 except that I have already extracted ocrd-testset.zip to ./data/ground-truth.
I typed the commands:
root@CUDA1:/home/ocrd-train# export PYTHONIOENCODING=utf8
root@CUDA1:/home/ocrd-train# make training
Output:
tesseract data/ground-truth/alexis_ruhe01_1852_0018_022.tif data/ground-truth/alexis_ruhe01_1852_0018_022 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/alexis_ruhe01_1852_0018_022.tif
.
.
.
tesseract data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif data/ground-truth/wienbarg_feldzuege_1834_0318_006 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Failed to read boxes from data/ground-truth/wienbarg_feldzuege_1834_0318_006.tif
find data/ground-truth -name '*.lstmf' -exec echo {} ; | sort -R -o "data/all-lstmf"
total=cat data/all-lstmf | wc -l
no=echo "$total * 0.90 / 1" | bc
;
head -n "$no" data/all-lstmf > "data/list.train"
total=cat data/all-lstmf | wc -l
no=echo "($total - $total * 0.90) / 1" | bc
;
tail -n "$no" data/all-lstmf > "data/list.eval"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2019-02-27 23:40:21-- https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.255.112, 192.30.255.113
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2019-02-27 23:40:21-- https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.196.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.196.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: 'data/radical-stroke.txt'
data/radical-stroke.txt 100%[======================================================================>] 323.12K --.-KB/s in 0.03s
2019-02-27 23:40:22 (11.5 MB/s) - 'data/radical-stroke.txt' saved [330874/330874]
combine_lang_model
--input_unicharset data/unicharset
--script_dir data/
--output_dir data/
--lang foo
Loaded unicharset of size 15 from file data/unicharset
Setting unichar properties
Other case A of a is not in unicharset
Other case N of n is not in unicharset
Other case D of d is not in unicharset
Other case E of e is not in unicharset
Other case R of r is not in unicharset
Other case h of H is not in unicharset
Other case F of f is not in unicharset
Other case c of C is not in unicharset
Other case V of v is not in unicharset
Other case L of l is not in unicharset
Other case I of i is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = a
Warning: properties incomplete for index 4 = n
Warning: properties incomplete for index 5 = d
Warning: properties incomplete for index 6 = e
Warning: properties incomplete for index 7 = r
Warning: properties incomplete for index 8 = H
Warning: properties incomplete for index 9 = f
Warning: properties incomplete for index 10 = C
Warning: properties incomplete for index 11 = v
Warning: properties incomplete for index 12 = l
Warning: properties incomplete for index 13 = i
Warning: properties incomplete for index 14 = .
Config file is optional, continuing...
Failed to read data from: data//foo/foo.config
Null char=2
mkdir -p data/checkpoints
lstmtraining
--traineddata data/foo/foo.traineddata
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset
]"
--model_output data/checkpoints/foo
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000
Warning: given outputs 15 not equal to unicharset of 14.
Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc14:14, 3598
Total weights = 507502
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc14] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c15]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=13
Loaded 1/1 pages (1-1) of document data/ground-truth/alexis_ruhe01_1852_0332_007.lstmf
Failed to load list of eval filenames from data/list.eval
Failed to load eval data from: data/list.eval
Makefile:144: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** [data/checkpoints/foo_checkpoint] Error 1
Hello dear OCR-D, thank you for making life easier with this repository.
I'm trying to run the make file but I keep getting this error.
_Run_ning aclocal
Running /usr/bin/libtoolize
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
Running autoheader
Running automake --add-missing --copy
configure.ac:314: installing 'config/compile'
configure.ac:23: installing 'config/missing'
src/api/Makefile.am: installing 'config/depcomp'
Running autoconf
Missing autoconf-archive. Check the build requirements.
Something went wrong, bailing out!
Makefile:162: recipe for target 'tesseract.built' failed
make: *** [tesseract.built] Error 1__
Any ideas?
thank you!
Hello!
I'm trying to run the makefile with the test set provided in the directory without success. I created the .box filesm but the script end with this error when i run Make training:
combine_tessdata -u /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.traineddata /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.
Failed to read /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/usr/share/tessdata
Makefile:97: recipe for target 'data/unicharset' failed
make: *** [data/unicharset] Error 1
I know it's a problem with my PATH but I don't really understand it. In that folder there are the .traineddata files.
In the OCR group i got this reply:
_You have some problems with your path configuration, check the error message:
Failed to read /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata
the path does not make sense. And also the command line:
combine_tessdata -u /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.traineddata /home/tulip/Documents/Em/OCR/OCRtraining/ocrd-train/usr/share/tessdata /foo.
you probably also have a "blank" after "/usr/share/tessdata".
Bye
Lorenzo_
But I still don't understand why this happens, what do I have to modify in the Makefile to make it work?
Thank you!
tesseract data/ground-truth/alexis_ruhe01_1852_0087_027.tif data/ground-truth/alexis_ruhe01_1852_0087_027 --psm 6 lstm.train
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadFromMultipageTiff: function not present
Hi,
I got a problem when I try to run the make training example shown in the README. I have installed tesseract and leptonica, but when execute the training command, I got the following message:
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Failed to read data from: data/all-boxes
Wrote unicharset file data/unicharset
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
no=`echo "$total * 0.90 / 1" | bc`; \
head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
no=`echo "($total - $total * 0.90) / 1" | bc`; \
tail -n "$no" data/all-lstmf > "data/list.eval"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-26 17:13:03-- https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-26 17:13:03-- https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’
data/radical-stroke.txt 100%[=================================================================>] 323,12K 835KB/s in 0,4s
2018-12-26 17:13:05 (835 KB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir data/ \
--output_dir data/ \
--lang test_model
Loaded unicharset of size 3 from file data/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Config file is optional, continuing...
Failed to read data from: data//test_model/test_model.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
--traineddata data/test_model/test_model.traineddata \
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
--model_output data/checkpoints/test_model \
--learning_rate 20e-4 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 10000
Failed to load list of training filenames from data/list.train
Makefile:144: recipe for target 'data/checkpoints/test_model_checkpoint' failed
make: *** [data/checkpoints/test_model_checkpoint] Error 1
Thanks.
Hi!
I am trying to create a model with the data set given in the repo. I tried below command.
make training MODEL_NAME=name-of-the-resulting-mode
Dependencies are already installed in my mac with help of homebrew, I am getting the following errors.
Unfortunately, when I try this command I get the following errors:
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Failed to read data from: data/all-boxes
Wrote unicharset file data/unicharset
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
no=`echo "$total * 0.90 / 1" | bc`; \
head -n "$no" data/all-lstmf > "data/list.train"
head: illegal line count -- 0
make: *** [data/list.train] Error 1
Thanks
while we train for sequence of characters in devanagari script like क्ष (क ् ष) it give us error.
How to trained such joint word/ letters having more than one characters
error.txt
Hi!
I am trying to train a model with the test dataset given in the repo.
I am trying the following command after the installation of the dependencies:
make training MODEL_NAME=name-of-the-resulting-model
Unfortunately when I try this command I get the following errors:
tesseract data/train/image.tif data/train/image --psm 6 lstm.train
Error in pixReadMemTiff: function not present
Error in pixReadMem: tiff: no pix returned
Error in pixaGenerateFontFromString: pix not made
Error in bmfCreate: font pixa not made
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadFromMultipageTiff: function not present
...
...
Failed to load list of training filenames from data/list.train
Makefile:129 : recipe for target « data/checkpoints/name-of-the-resulting-model_checkpoint » failed
make: *** [data/checkpoints/name-of-the-resulting-model_checkpoint] Error 1
Thanks!
Right now:
Traceback (most recent call last):
File "generate_line_box.py", line 41, in <module>
if not unicodedata.combining(line[-1]):
IndexError: string index out of range
Makefile:111: recipe for target 'data/ground-truth/example.box' failed
#15 implements only a part of the necessary changes for using CONTINUE_FROM
, namely the unicharset merging. Setting the parameters for training as proposed by @Shreeshrii is still missing!
TESSDATA_REPO = _fast
_fast models are integer models and can NOT be used as base for finetuning. Only models from tessdata_best can be used for this.
I have try the command line on Windows, it does not work.
D:\MyGithub\ocrd-train-master> make leptonica tesseract langdata
wget 'http://www.leptonica.org/source/leptonica-1.76.0.tar.gz'
process_begin: CreateProcess(NULL, wget http://www.leptonica.org/source/leptonica-1.76.0.tar.gz, ...) failed.
make (e=2):
Makefile:141: recipe for target 'leptonica-1.76.0.tar.gz' failed
make: *** [leptonica-1.76.0.tar.gz] Error 2
python generate_line_box.py -i "data/train/alexis_ruhe01_1852_0035_019.tif" -t "data/train/alexis_ruhe01_1852_0035_019.gt.txt" > "data/train/alexis_ruhe01_1852_0035_019.box"
Traceback (most recent call last):
File "generate_line_box.py", line 26, in <module>
im = Image.open(file(args.image, "r"))
NameError: name 'file' is not defined
Makefile:91: recipe for target 'data/train/alexis_ruhe01_1852_0035_019.box' failed
make: *** [data/train/alexis_ruhe01_1852_0035_019.box] Error 1
I get an error when running make training
. It seems to insert an extra "/" in the reference to the data directory in one of the steps. Running on Ubuntu 18.04 in a Docker container.
Loaded unicharset of size 79 from file data/unicharset
Setting unichar properties
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Uͤ of uͤ is not in unicharset
Other case Aͤ of aͤ is not in unicharset
Other case Oͤ of oͤ is not in unicharset
Other case Y of y is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Cf. #40
Having fewer than 10 lines results in an empty eval list.
Hi
it has been clarified else where that box files for RTL languages should be generated like LTR languages. The input data format for OCR-D is line images with corresponding text strings. The example data provided in the readme is straightforward for LTR script. However, is there a difference for RTL languages? Should the text string in .gt.txt be reversed? We are trying to train for Urdu but the final error rate is 90% or above for 444 line pairs (sample attached). We suspect that the direction is the cause. If that is indeed the case, should the text files by reversed at character level?
tesseract 4.0.0
leptonica-1.76.0
libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64
Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like
"ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.
The transcription generated by the new model can be found below (Fine Tuned.txt):
Fine Tuned.txt
The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt):
Arabic Trained Model.txt
The fine tuned model can be found below (test1.traineddata):
test1.traineddata.zip
I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.
A sample of my training data which includes the .box and .lstmf is attached below:
training data.zip
I am trying to run Urdu training data (using Noori Nastaleeq font) but make training urd
results in the following:
python generate_line_box.py -i "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.tif" -t "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.gt.txt" > "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box"
Traceback (most recent call last):
File "generate_line_box.py", line 41, in <module>
print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0645' in position 0: ordinal not in range(128)
Makefile:111: recipe for target 'data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box' failed
I have attached the problematic line image with .gt.txt. The files are generated on Windows uisng GDI and .net and imported to Linux. Putting urd.traineddata beforehand doesn't help as well.
output.zip
In the Makefile, here is the code for finetuning
ifdef START_MODEL
mkdir -p data/checkpoints
lstmtraining
--traineddata $(PROTO_MODEL)
--old_traineddata
--continue_from data/$(START_MODEL)/$(START_MODEL).lstm
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset
]"
--model_output data/checkpoints/$(MODEL_NAME)
--learning_rate 20e-4
--train_listfile data/list.train
--eval_listfile data/list.eval
--max_iterations 10000
Why do we need the following line? I thought it was only used in training from scratch.
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1chead -n1 data/unicharset
]" \
Should the learning rate be set lower for fine-tuning? The learning rate for training from scratch is 20e-4, so it would seem that the learning rate for fine-tuning should be significantly lower?
--learning_rate 20e-4 \
It turned out that only the file radical-stroke.txt is needed for bootstrapping.
I modified the .py file because I kept getting a unicode error:
I just inserted three more lines of code at the beggining of the file
import io
import argparse
import unicodedata
from PIL import Image
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
cheers
Hi guys,
I’m pretty new to this, so please forgive if I’m missing something obvious.
I’ve posted to the mailing list because Tesseract sometimes confuses the digit 4 with a 9 in the material I’m currently processing. Someone over there pointed me to your project.
So what I would like to do now is to finetune the Latin script to fix the recognition errors I’m seeing. If I understand this correctly, I’ll need to go to data/ground-truth
and create files there for Tesseract to learn from, e.g:
April_2014.gt.txt
April 2014
April_2014.tif
Also, I’ve cloned https://github.com/tesseract-ocr/tessdata.
What else do I need to do? The reason I’m asking is because I get an error when executing the following:
$ make -j4 training START_MODEL=Latin TESSDATA=/home/vagrant/tessdata/script
python generate_line_box.py -i "data/ground-truth/April_2012.tif" -t "data/ground-truth/April_2012.gt.txt" > "data/ground-truth/April_2012.box"
python generate_line_box.py -i "data/ground-truth/April_2013.tif" -t "data/ground-truth/April_2013.gt.txt" > "data/ground-truth/April_2013.box"
python generate_line_box.py -i "data/ground-truth/April_2014.tif" -t "data/ground-truth/April_2014.gt.txt" > "data/ground-truth/April_2014.box"
wget -Odata/radical-stroke.txt 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt'
--2018-12-11 00:00:10-- https://github.com/tesseract-ocr/langdata_lstm/raw/master/radical-stroke.txt
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... tesseract data/ground-truth/April_2012.tif data/ground-truth/April_2012 --psm 6 lstm.train
tesseract data/ground-truth/April_2014.tif data/ground-truth/April_2014 --psm 6 lstm.train
find data/ground-truth -name '*.box' -exec cat {} \; > "data/all-boxes"
connected.
tesseract data/ground-truth/April_2013.tif data/ground-truth/April_2013 --psm 6 lstm.train
HTTP request sent, awaiting response... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
mkdir -p data/Latin
combine_tessdata -u /home/vagrant/tessdata/script/Latin.traineddata data/Latin/Latin
302 Found
Location: https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt [following]
--2018-12-11 00:00:10-- https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330874 (323K) [text/plain]
Saving to: ‘data/radical-stroke.txt’
data/radical-stroke.txt 0%[ ] 0 --.-KB/s Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Extracting tessdata components from /home/vagrant/tessdata/script/Latin.traineddata
Wrote data/Latin/Latin.lstm
Wrote data/Latin/Latin.lstm-punc-dawg
find data/ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
data/radical-stroke.txt 100%[==============================================================================================================================================>] 323.12K --.-KB/s in 0.1s
2018-12-11 00:00:10 (2.40 MB/s) - ‘data/radical-stroke.txt’ saved [330874/330874]
total=`cat data/all-lstmf | wc -l` \
no=`echo "$total * 0.90 / 1" | bc`; \
head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
no=`echo "($total - $total * 0.90) / 1" | bc`; \
tail -n "$no" data/all-lstmf > "data/list.eval"
Wrote data/Latin/Latin.lstm-word-dawg
Wrote data/Latin/Latin.lstm-number-dawg
Wrote data/Latin/Latin.lstm-unicharset
Wrote data/Latin/Latin.lstm-recoder
Wrote data/Latin/Latin.version
Version string:4.00.00alpha:Latin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=1587099, offset=192
18:lstm-punc-dawg:size=5954, offset=1587291
19:lstm-word-dawg:size=88816882, offset=1593245
20:lstm-number-dawg:size=86050, offset=90410127
21:lstm-unicharset:size=18023, offset=90496177
22:lstm-recoder:size=2735, offset=90514200
23:version:size=82, offset=90516935
unicharset_extractor --output_unicharset "data/ground-truth/my.unicharset" --norm_mode 2 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Other case a of A is not in unicharset
Other case P of p is not in unicharset
Other case R of r is not in unicharset
Other case I of i is not in unicharset
Other case L of l is not in unicharset
Wrote unicharset file data/ground-truth/my.unicharset
merge_unicharsets data/Latin/Latin.lstm-unicharset data/ground-truth/my.unicharset "data/unicharset"
Loaded unicharset of size 303 from file data/Latin/Latin.lstm-unicharset
Loaded unicharset of size 13 from file data/ground-truth/my.unicharset
Wrote unicharset file data/unicharset.
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir data/ \
--output_dir data/ \
--lang foo
Loaded unicharset of size 303 from file data/unicharset
Setting unichar properties
Other case Ẹ̀ of ẹ̀ is not in unicharset
Setting script properties
Failed to load script unicharset from:data//Latin.unicharset
Warning: properties incomplete for index 3 = K
#
# ... truncated ...
#
Warning: properties incomplete for index 302 = ẹ̀
Config file is optional, continuing...
Failed to read data from: data//foo/foo.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
--traineddata data/foo/foo.traineddata \
--old_traineddata /home/vagrant/tessdata/script/Latin.traineddata \
--continue_from data/Latin/Latin.lstm \
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
--model_output data/checkpoints/foo \
--learning_rate 20e-4 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 10000
Loaded file data/Latin/Latin.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 302 to 302!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx512:512, 1247232
Fc302:302, 0
Total weights = 1404064
Previous null char=301 mapped to 301
Continuing from data/Latin/Latin.lstm
Loaded 1/1 pages (1-1) of document data/ground-truth/April_2014.lstmf
Failed to load list of eval filenames from data/list.eval
Failed to load eval data from: data/list.eval
Makefile:131: recipe for target 'data/checkpoints/foo_checkpoint' failed
make: *** [data/checkpoints/foo_checkpoint] Segmentation fault (core dumped)
Looks like it wants a data/list.eval
file which isn’t there. Is this why it’s crashing?
I’m running this on Ubuntu 16.04.
Thank you!
Log generated during training
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Loaded 1/1 pages (1-1) of document data/ground-truth/bori-g0351-14-026.exp0.lstmf
At iteration 1189/1200/1200, Mean rms=6.836%, delta=55.863%, char train=111.93%, word train=99.35%, skip ratio=0%, New worst char error = 111.93 wrote checkpoint.
Loaded 1/1 pages (1-1) of document data/ground-truth/bori-g0351-14-014.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1289/1300/1300, Mean rms=6.812%, delta=55.992%, char train=111.945%, word train=98.95%, skip ratio=0%, New worst char error = 111.945At iteration 1090, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 1389/1400/1400, Mean rms=6.806%, delta=56.201%, char train=114.316%, word train=98.65%, skip ratio=0%, New worst char error = 114.316At iteration 1189, stage 0, Eval Char error rate=100, Word error rate=100 wrote checkpoint.
.
.
.
.
.
.
Finished! Error rate = 87.055
debug_interval -1
/bin/bash: debug_interval: command not found
make: [Makefile:148: data/checkpoints/mar10_checkpoint] Error 127 (ignored)
I'm training with a new set of two fonts and the goal is to use Tesseract to analyze individual chars (only capital letters and numbers), not entire words, but the results are far from being decent and it looks like I'm doing something wrong. Tesseract and Leptonica are installed by the scripts.
Inspired by the test set provided in this repo, I created these tif files with their correct gt.txt's:
From original binarized chars:
From two TTFs to TIF images with random text:
At the end of the data creation process I have 1869 mixed text lines.
First I ran the makefile with 10000 iterations as default, but the best error rate was high.
I thought it was a matter of needing more iterations, so I changed to 30000 but nothing got better. The following image shows the results:
Sometimes I can see the char train increasing instead of getting lower.
What am I missing here? Do I need more training data? Is my initial data not following any important concept?
I'd appreciate any help!
Hi @kba
I have tried the code of tried
combine_tessdata -u /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.traineddata /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.
Failed to read /mnt/e/projects/Training_Tesseract/ocrd-train/usr/share/tessdata/foo.traineddata
Makefile:98: recipe for target 'data/unicharset' failed
I haven't seen any file called foo.traineddata
in the tessdata
directory. what do i need to do here
when I change foo.traineddata
with eng.traineddata
. this command works fine which is expected as combine_tessdata -u
extract exiting traineddata
Thanks in advance
E.g. do not use file
constructor function, #9
The script works for line level images.
I have a number of scanned page images with ground truth files.
Does OCR-D project have any tools to segment it to line images with corresponding ground truth text?
I have seen a closed issue about this problem before. As suggested I switched to python3 but the problem still persists.
Here is the output log
python generate_line_box.py -i "data/train/alexis_ruhe01_1852_0018_022.tif" -t "data/train/alexis_ruhe01_1852_0018_022.gt.txt" > "data/train/alexis_ruhe01_1852_0018_022.box"
Traceback (most recent call last):
File "generate_line_box.py", line 40, in <module>
print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128)
Makefile:110: recipe for target 'data/train/alexis_ruhe01_1852_0018_022.box' failed
make: *** [data/train/alexis_ruhe01_1852_0018_022.box] Error 1
I think this line, in the "data/list.eval" block:
tail -n "+$$no" $(ALL_LSTMF) > "$@"
should be:
tail -n "$$no" $(ALL_LSTMF) > "$@"
Ground Truth within OCR-D is mainly represented in PAGE XML. Training based on input files in this format is highly desirable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.