Comments (3)
Worked around this with the below bash scripting to speed things up:
cd tesstrain
# Generate .box's
find data/*ground-truth/ -type f -name '*.tif' | while read line ; do [ ! -f "${line/.*/}.box" ] && echo "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i \"${line}\" -t \"${line/.*/}.gt.txt\" > \"${line/.*/}.box\"" ; done | parallel -j$(nproc)
# Generate .lstmf's
find data/*ground-truth/ -type f -name '*.tif' | while read line ; do [ ! -f "${line/.*/}" ] && [ -f "${line/.*/}.box" ] && [ ! -f "${line/.*/}.lstmf" ] && echo "tesseract \"${line}\" ${line/.*/} --psm 13 lstm.train" ; done | parallel -j$(nproc)
from tesstrain.
I always used make -j
for parallel builds of box and lstmf files, and it worked fine (with png images instead of tiff, but that should not matter). Meanwhile I have an even better alternative which no longer requires box and lstmf files.
from tesstrain.
Hi - not necessarily the answer you were looking for, but Tesstrain is essentially a wrapper to help you run a sequence of Tesseract binaries with hopefully the correct parameters. Here is my way to significantly speed up the development process -
- Use GPT/Claude to decompose the Tesstrain makefile into a series of components:
- A master
Makefile
- A
config.mk
to store all parameters that can be included by the various components unicharset.mk
formake unicharset
lists.mk
formake lists
training.mk
formake training
- and perhaps a
misc.mk
for the rest
- Understand what each component does. Ask GPT/Claude to explain to you if needed.
- Translate the core task of each
.mk
component to Python, which GPT/Claude is much better at. Have, say,unicharset.mk
to callPython unicharset.py
to execute the same tasks. - Identify in each
.py
what tasks are parallelizable, and ask GPT/Claude to modify the code to leverage multithreading or multiprocessing.
from tesstrain.
Related Issues (20)
- fine tuning arabic traineddata to solve extended words issue HOT 2
- Error while compiling tesseract within tesstrain HOT 2
- Maths OCR
- Can't open lstm.train despite (probably) having all training tools HOT 1
- Training a model from scratch with own imgs + txts? HOT 1
- Trying to train Tesseract for a different font, unable to get CER under 50%
- File not found - *.gt.txt HOT 3
- Error fine tuning new font for Thai Language
- What if my ground truth includes characters not found in a *.unicharset?
- Error generate text2image using khm.training_text HOT 1
- make training not building traineddata file HOT 1
- deu_latf wordfile HOT 4
- unicharset_extractor stuck HOT 1
- How to train captcha? HOT 4
- winget install GnuWin32.Make error HOT 10
- make tesseract-langdata error HOT 7
- A question about missing dependency warnings when compiling and installing tesseract on centos using source code HOT 1
- How to train Chinese tradtional vertical in Tesseract 5? HOT 1
- "Compute CTC targets failed for xyz.lstmf!" for custom NET_SPECs HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesstrain.