Git Product home page Git Product logo

textrecognitiondatagenerator's Introduction

TextRecognitionDataGenerator CircleCI PyPI version codecov Documentation Status

A synthetic data generator for text recognition

What is it for?

Generating text image samples to train an OCR software. Now supporting non-latin text! For a more thorough tutorial see the official documentation.

What do I need to make it work?

Install the pypi package

pip install trdg

Afterwards, you can use trdg from the CLI. I recommend using a virtualenv instead of installing with sudo.

If you want to add another language, you can clone the repository instead. Simply run pip install -r requirements.txt

Docker image

If you would rather not have to install anything to use TextRecognitionDataGenerator, you can pull the docker image.

docker pull belval/trdg:latest

docker run -v /output/path/:/app/out/ -t belval/trdg:latest trdg [args]

The path (/output/path/) must be absolute.

New

  • Add --stroke_width argument to set the width of the text stroke (Thank you @SunHaozhe)
  • Add --stroke_fill argument to set the color of the text contour if stroke > 0 (Thank you @SunHaozhe)
  • Add --word_split argument to split on word instead of per-character. This is useful for ligature-based languages
  • Add --dict argument to specify a custom dictionary (Thank you @luh0907)
  • Add --font_dir argument to specify the fonts to use
  • Add --output_mask to output character-level mask for each image
  • Add --character_spacing to control space between characters (in pixels)
  • Add python module
  • Add --font to use only one font for all the generated images (Thank you @JulienCoutault!)
  • Add --fit and --margins for finer layout control
  • Change the text orientation using the -or parameter
  • Specify text color range using -tc '#000000,#FFFFFF', please note that the quotes are necessary
  • Add support for Simplified and Traditional Chinese

How does it work?

Words will be randomly chosen from a dictionary of a specific language. Then an image of those words will be generated by using font, background, and modifications (skewing, blurring, etc.) as specified.

Basic (Python module)

The usage as a Python module is very similar to the CLI, but it is more flexible if you want to include it directly in your training pipeline, and will consume less space and memory. There are 4 generators that can be used.

from trdg.generators import (
    GeneratorFromDict,
    GeneratorFromRandom,
    GeneratorFromStrings,
    GeneratorFromWikipedia,
)

# The generators use the same arguments as the CLI, only as parameters
generator = GeneratorFromStrings(
    ['Test1', 'Test2', 'Test3'],
    blur=2,
    random_blur=True
)

for img, lbl in generator:
    # Do something with the pillow images here.

You can see the full class definition here:

Basic (CLI)

trdg -c 1000 -w 5 -f 64

You get 1,000 randomly generated images with random text on them like:

1 2 3 4 5

By default, they will be generated to out/ in the current working directory.

Text skewing

What if you want random skewing? Add -k and -rk (trdg -c 1000 -w 5 -f 64 -k 5 -rk)

6 7 8 9 10

Text distortion

You can also add distortion to the generated text with -d and -do

23 24 25

Text blurring

But scanned document usually aren't that clear are they? Add -bl and -rbl to get gaussian blur on the generated image with user-defined radius (here 0, 1, 2, 4):

11 12 13 14

Background

Maybe you want another background? Add -b to define one of the three available backgrounds: gaussian noise (0), plain white (1), quasicrystal (2) or image (3).

15 16 17 23

When using image background (3). A image from the images/ folder will be randomly selected and the text will be written on it.

Handwritten

Or maybe you are working on an OCR for handwritten text? Add -hw! (Experimental)

18 19 20 21 22

It uses a Tensorflow model trained using this excellent project by Grzego.

The project does not require TensorFlow to run if you aren't using this feature

Dictionary

The text is chosen at random in a dictionary file (that can be found in the dicts folder) and drawn on a white background made with Gaussian noise. The resulting image is saved as [text]_[index].jpg

There are a lot of parameters that you can tune to get the results you want, therefore I recommend checking out trdg -h for more information.

Create images with Chinese text

It is simple! Just do trdg -l cn -c 1000 -w 5!

Generated texts come both in simplified and traditional Chinese scripts.

Traditional:

27

Simplified:

28

Create images with Japanese text

It is simple! Just do trdg -l ja -c 1000 -w 5!

Output

29

Add new fonts

The script picks a font at random from the fonts directory.

Directory Languages
fonts/latin English, French, Spanish, German
fonts/cn Chinese
fonts/ko Korean
fonts/ja Japanese
fonts/th Thai

Simply add/remove fonts until you get the desired output.

If you want to add a new non-latin language, the amount of work is minimal.

  1. Create a new folder with your language two-letters code
  2. Add a .ttf font in it
  3. Edit run.py to add an if statement in load_fonts()
  4. Add a text file in dicts with the same two-letters code
  5. Run the tool as you normally would but add -l with your two-letters code

It only supports .ttf for now.

Benchmarks

Number of images generated per second.

  • Intel Core i7-4710HQ @ 2.50Ghz + SSD (-c 1000 -w 1)
    • -t 1 : 363 img/s
    • -t 2 : 694 img/s
    • -t 4 : 1300 img/s
    • -t 8 : 1500 img/s
  • AMD Ryzen 7 1700 @ 4.0Ghz + SSD (-c 1000 -w 1)
    • -t 1 : 558 img/s
    • -t 2 : 1045 img/s
    • -t 4 : 2107 img/s
    • -t 8 : 3297 img/s

Contributing

  1. Create an issue describing the feature you'll be working on
  2. Code said feature
  3. Create a pull request

Feature request & issues

If anything is missing, unclear, or simply not working, open an issue on the repository.

What is left to do?

  • Better background generation
  • Better handwritten text generation
  • More customization parameters (mostly regarding background)

textrecognitiondatagenerator's People

Contributors

astrocket avatar bact avatar belval avatar dc-chengchao avatar edwardpwtsoi avatar elahe-dastan avatar enzodtz avatar euihyun-lee avatar fhainzl avatar flming avatar gachiemchiep avatar hendraet avatar hrazhan avatar ifeitao avatar iknoorjobs avatar jinmingteo avatar jtwsmeal avatar juliencoutault avatar junxnone avatar lizhq avatar luangtatipsy avatar mohamadmansourx avatar nicolasmetallo avatar pyaephyokhant avatar rkcosmos avatar stweil avatar sunhaozhe avatar wangershi avatar yacobby avatar zhenglilei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textrecognitiondatagenerator's Issues

TextDataRecognitionGenerator as a python module

The preferred usage had always been through the CLI. Unfortunately, this approach is not frictionless when used in a real machine learning pipeline that might include data augmentations.

The v1 candidate would be giving access to the data generators classes and have an easy to use interface that can be used as seamlessly as the CLI.

A package would be uploaded to pypi for ease of use.

ETA on numbers and symbols

Hi,

Great project, could you please let me know a rough time frame for a release with numbers and symbols included?

Best,
Vishal

Arabic text generator

Hi,

File names generated by the Arabic version of the repo are correct as the word letters are connected. However, text in images has disconnected letters and the words started from left to right. The text in an image should be started from right to left and the letter must be connected. Any suggestion on how to correct these issues?

Thanks

text-color is unable to apply

found a bug in text-color use.
add background = background.convert('RGBA') in data_generator.py at line 89 can fix this problem

I got this error ,can anyone help me ?

1 .error:
/data/20180809/TextRecognitionDataGenerator-master/TextRecognitionDataGenerator# python run.py -i "texts/subtitle.txt" -c 100 -w 5 -e png -b 3
Missing modules for handwritten text generation.
31%|#####################################2 | 31/100 [00:00<00:01, 68.48it/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/data/20180809/TextRecognitionDataGenerator-master/TextRecognitionDataGenerator/data_generator.py", line 22, in generate_from_tuple
cls.generate(*t)
File "/data/20180809/TextRecognitionDataGenerator-master/TextRecognitionDataGenerator/data_generator.py", line 34, in generate
image = ComputerTextGenerator.generate(text, font, text_color)
File "/data/20180809/TextRecognitionDataGenerator-master/TextRecognitionDataGenerator/computer_text_generator.py", line 12, in generate
image_font = ImageFont.truetype(font=font, size=32)
File "/data/20180809/TextRecognitionDataGenerator-master/py3env/lib/python3.4/site-packages/PIL/ImageFont.py", line 261, in truetype
return FreeTypeFont(font, size, index, encoding, layout_engine)
File "/data/20180809/TextRecognitionDataGenerator-master/py3env/lib/python3.4/site-packages/PIL/ImageFont.py", line 144, in init
self.font = core.getfont(font, size, index, encoding, layout_engine=layout_engine)
OSError: unknown file format
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run.py", line 290, in
main()
File "run.py", line 278, in main
), total=args.count):
File "/data/20180809/TextRecognitionDataGenerator-master/py3env/lib/python3.4/site-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/lib/python3.4/multiprocessing/pool.py", line 689, in next
raise value
OSError: unknown file format

2.when i use my own background picture,i got the blurry picture,but i want a clear one.
image
image
why they got different width(ps:i use my own texts)
Looking forward to hear from you .
@Belval

Width and height of image

Why is the new_width being calculated in this line if it's not used anywhere else? Is it supposed to be used when resizing the image?

new_width = float(new_text_width + 10) * (float(height) / float(new_text_height + 10))

But the variable you are using when resizing is new_text_width.

image_on_background = background.resize((int(new_text_width), height), Image.ANTIALIAS)

I got this error, can anyone help me, please?

here is the error
python run.py -w 5 -f 64 -l am
0%| | 0/1000 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/test/Anaconda3/envs/py35/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/test/Documents/direse/scene/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 22, in generate_from_tuple
cls.generate(*t)
File "/home/test/Documents/direse/scene/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 36, in generate
image = ComputerTextGenerator.generate(text, font, text_color, size, orientation, space_width)
File "/home/test/Documents/direse/scene/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 9, in generate
return cls.__generate_horizontal_text(text, font, text_color, font_size, space_width)
File "/home/test/Documents/direse/scene/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 17, in __generate_horizontal_text
image_font = ImageFont.truetype(font=font, size=font_size)
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/PIL/ImageFont.py", line 261, in truetype
return FreeTypeFont(font, size, index, encoding, layout_engine)
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/PIL/ImageFont.py", line 144, in init
self.font = core.getfont(font, size, index, encoding, layout_engine=layout_engine)
OSError: unknown file format
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run.py", line 342, in
main()
File "run.py", line 330, in main
), total=args.count):
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/home/test/Anaconda3/envs/py35/lib/python3.5/multiprocessing/pool.py", line 731, in next
raise value
OSError: unknown file format
Exception ignored in: <bound method tqdm.del of 0%| | 0/1000 [00:00<?, ?it/s]>
Traceback (most recent call last):
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/tqdm/_tqdm.py", line 882, in del
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1087, in close
File "/home/test/Anaconda3/envs/py35/lib/python3.5/site-packages/tqdm/_tqdm.py", line 439, in _decr_instances
File "/home/test/Anaconda3/envs/py35/lib/python3.5/_weakrefset.py", line 109, in remove
KeyError: <weakref at 0x7f686b4df598; to 'tqdm' at 0x7f686b53cd30>

Save generated text to file separately

How can I save generated text to the file alongside with the corresponding image? For example text 'foo' was generated and was put on some background and saved as foo.jpg. Can I save also text 'foo' to some file call it foo.txt where will be only text 'foo'?

How to keep words in order

For example,I want to create a rule that some words must appear before some other words.Some words interval occur.So I guess when I use RNN network may have a better performance.

labels

hello,thanks firstly, is there a labels.txt in the code for the generated images?

run.py crashes (OSError: unknown file format) when generating dataset for new non- latin language

I added support for hebrew with some .fft fonts and a dictionary. The adjusted run.py and datagenerator.py files run and work till they crash. When I put run.py in a for loop it some times works flawlessly (and generates images) and sometimes crashes. Any thoughts?

text_rec_task) galmoore@Gals-MacBook-Pro:~/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator$ python run_script.py

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 56.40it/s]

Missing modules for handwritten text generation.

args count10

100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 102.17it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 67.58it/s]

Missing modules for handwritten text generation.

args count10

100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 108.27it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.82it/s]

Missing modules for handwritten text generation.

args count10

0%| | 0/10 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback:

"""

Traceback (most recent call last):

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker

result = (True, func(*args, **kwds))

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple

cls.generate(*t)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate

image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate

return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text

image_font = ImageFont.truetype(font=font, size=font_size)

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype

return FreeTypeFont(font, size, index, encoding, layout_engine)

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init

layout_engine=layout_engine)

OSError: unknown file format

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "run.py", line 376, in

main()

File "run.py", line 364, in main

), total=args.count):

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter

for obj in iterable:

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next

raise value

OSError: unknown file format

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.12it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 44.38it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 37.21it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 53.60it/s]

Tighter cropping

Right now the images have quite a bit of unnecessary padding around the text, which reduces the usability of the generated dataset for specific tasks.

Instead, make text as big as possible and add a padding argument.

run.py arguments in README.md

It would be very useful if all the command line arguments were mentioned in the readme file.
Obviously, it could be viewed in the run.py file. Still, a person who is cloning the repo for the first time may not know all the options available.

Thanks!

Support arabic and urdu text

Enhancement, but it would be interesting to add support for arabic and hindi scripts.

I think adding a new font folder and a new dict for both languages would work.

Create comprehensive test suite

As the number of feature grows I can barely check for regression bugs. Therefore a test suite should be made with a continuous integration like TravisCI.

Text length

Can I generate texts with a fixed length size?

osError: broken file

Traceback (most recent call last):
File "run.py", line 340, in
main()
File "run.py", line 328, in main
), total=args.count):
File "/data/env/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in iter
for obj in iterable:
File "/data/env/anaconda3/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
OSError: broken file
Exception ignored in: <bound method tqdm.del of 0%| | 21/500000 [00:00<6:24:25, 21.68it/s]>
Traceback (most recent call last):
File "/data/env/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 883, in del
File "/data/env/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1088, in close
File "/data/env/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 439, in _decr_instances
File "/data/env/anaconda3/lib/python3.6/_weakrefset.py", line 109, in remove
KeyError: <weakref at 0x7f0af5879688; to 'tqdm' at 0x7f0bc8e1d0f0>

Find better german and spanish dictionaries

The dicts provided with the project for german and spanish are non-utf8. Unfortunately that means encoding errors may arise.

I will therefore try to replace the current dicts.

Hollow letters when using handwritten

This is probably a regression as it does not show in the generated examples in the README.md.

nonliturgic_0

W can see when using a higher resolution that the letters are "hollow" in that there seems to be two lines per stoke instead of one bold line.

The color is also off grey-ish while it should be black. If possible the --text_color parameters should be supported as well.

background generator may cause error

if picture.size[0] < width:
picture = picture.resize([width, int(picture.size[1] * (width / picture.size[0]))],
#what if resized height is still smaller than needed height?
Image.ANTIALIAS)
elif picture.size[1] < height:
picture.thumbnail([int(picture.size[0] * (height / picture.size[1])), height], Image.ANTIALIAS)

chinese font problem

some chinese fonts can not generate good samples(for example ,some word could not be generated),do you have some suggests to solve the problem .thank you in advance

Add argument font in run.py

Hi,
I used this repo to generate text with only one personal font.

I would like add a argument font in run.py, if a font is passed in param, it will be the only one used to generate pictures.

I will work in, I saw in readme you want an issue before a PR, so i open it 😊

hello, i change the font and have a error

`The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/liufengnan/Desktop/TextRecognitionDataGenerator/TextRecognitionDataGenerator/run.py", line 392, in
main()
File "/Users/liufengnan/Desktop/TextRecognitionDataGenerator/TextRecognitionDataGenerator/run.py", line 309, in main
), total=args.count):
File "/usr/local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in iter
for obj in iterable:
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
OSError: broken file
`

Need to change how to name the file

Something embarrassing happened when I was generating the image T.T

FileNotFoundError: [Errno 2] No such file or directory: 'out/튃 귍 쓌 / 혦 찵 컒 뵶 명 똔 톨 눔_124.jpg'

This symbol ‘/’ should be '//'

Generating Images similar to Oxford Synthetic Word Dataset

Hi,
I am trying to generate images containing single words similar to that in the Oxford Synthetic Word Dataset. The words will also contain symbols such as colon, percentage etc.
The process to create the Oxford dataset is described in the below image.
process

I am unsure how to generate such words along with symbols as I get such images below which are very much different from the ones in the Oxford dataset.
newsynth1
newsynth4

Images from Oxford dataset are given below,
synth1
synth2

Fix dependencies versioning

OpenCV bumped their version from 3.2 to 3.4. Following this change, they removed the 3.2 version from PyPI. This means that right now, someone who clones the repo and tries to install the dependencies with pip install -r requirements it will fail.

How to generate vertical images

for example we often read words and characters from left to right.
but in Chinese, we sometimes arrange characters from top to bottom.
So I just wonder can this code generate top to bottom configuration of Chinese sentences?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.