-
Download data and unzip it in root directory.
-
Create dir tesstrain/data/DIN-ground-truth
-
Download base model (*.traineddata) from tessdata best
-
Create a dir tesstrain/data/ara
-
Extract base model in the previous directory using command: combine_tessdata -u ara.traineddata dx
-
Copy traineddata file to tesstrain/tessdata
-
Make sure to install fonts used in split_training_text.py on local machine.
python3 parse_xml.py
python3 split_training_text.py
make training TESSDATA=tessdata GROUND_TRUTH_DIR=data/DIN-ground-truth MODEL_NAME=dx MAX_ITERATIONS=2500 START_MODEL=ara FINETUNE_TYPE=LAYER LANG_TYPE=RTL
sudo cp data/dx.traineddata /usr/share/tesseract-ocr/<tesseract_version>/tessdata/
tesseract image_file output.txt -l dx