clovaai / deep-text-recognition-benchmark Goto Github PK

View Code? Open in Web Editor NEW

3.6K 85.0 1.1K 3.07 MB

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

License: Apache License 2.0

Python 9.62% Jupyter Notebook 90.38%

iccv2019 ocr ocr-recognition text-recognition deep-learning scene-text-recognition recognition scene-text crnn rare

deep-text-recognition-benchmark's People

Contributors

Stargazers

Watchers

Forkers

fendaq jungwoo-ha jinah12 jeongjaewook vinit-io mercileesb elavin11 gzhcv whqchina sungraepark wini1680 happog sangkwun crazysal osirisjs stevelin168 zhenx hhliao lvpengyuan kts12345 fragrance307 suzy0223 jiangxiaoyan zhuyangda mooc-program-design-and-algorithm trantorrepository seominseok0429 nathangq moerbenkaola tinggh tkgw batermj jacklongking jinnion liushuchun asejouk zhenming33 gachiemchiep crossli shiruizhao amoonhappy tianbaochou nicksonyap cqray1990 zhangandin missyangx deepinsearch xiaolaodi tarsbase 1lovesjohnny brooks0519 mathpopo harold-lkk ieee820 emoron woofpc ziudeso wangning7149 maheshmadhusudanan okwrtdsh jeffrey98-ai fireae dexception yacobby kenny-ngo zhangsn828 jkllbn2563 jubird915 soldierofhell jordanmicahbennett yuckfu mahendra047 wenmuzhou gautamsharma0095 ashermancinelli amir22010 ustczhouyu dlml csyhhb iprayerr hhgxx123 kyriesu greengarnets alwc moyans cancerce1l chenjun2hao jack-zhen coallaoh 0x454447415244 liujiaming19910220 dsp6414 phvan2312 pacinoan jiluhu xiangliu886 billyzju kuhncwlvbswxclur ellery92 alkalami

deep-text-recognition-benchmark's Issues

How many iterations or loss, can I have a so-so results on my test images ?

set_storage is not allowed on Tensor created from .data or .detach()

model input parameters 32 100 20 1 512 256 96 25 TPS ResNet BiLSTM Attn
loading pretrained model from /home1/zy/STR/psenet_benchmark/benchmark/pretrained_model/TPS-ResNet-BiLSTM-Attn-case-sensitive.pth
Traceback (most recent call last):
File "/home1/zy/STR/psenet_benchmark/api.py", line 22, in
br(image_folder)
File "/home1/zy/STR/psenet_benchmark/api.py", line 15, in br
getrecognition(image_folder)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 121, in getrecognition
demo(opt)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 64, in demo
preds = model(image, text_for_pred, is_train=False)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/model.py", line 82, in forward
contextual_feature = self.SequenceModeling(visual_feature)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/modules/sequence_modeling.py", line 16, in forward
self.rnn.flatten_parameters()
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()

test ctc model it is giving runtime error?

when I trained the algorithm with the prediction CTC and when I going for test demo.py I

am getting the following error.

RuntimeError: storage has wrong size : expected -5510384958902273621 got 294912

The text shape when using Attention

Hi, thank you for sharing such great work! I am a novice and found sth i cannot understand:

The AttnLabelConverter encodes the labels into text, length. The text has shape [batch_size, max_length+2] where '2' means [go] and [stop].

When the text pass into Attention.forward(), however, the notes said text : the text-index of each image. [batch_size x (max_length+1)]. +1 for [GO] token. text[:, 0] = [GO].

Are these two text the same object? If so, why the shape is different?
Thanks for you help!

Simple image causing troubles

I added a simple test image to the demo folder and it gets wrongly recognized.
The image:

The output:
128456782012

The command I used is the same of the README.
Thinking that the problem could have been related to the string length, I tried retraining the TPS-ResNet-BiLSTM-Attn model using an imgW of 200 pixels, but the problem seems to be very similar.

Any idea on why this happens? It seems to me that this image is much simpler compared to the other demo images.

The loss converge but the accuracy always stay zero with optimizer adam.

I add some characters and fine tuned the pretrained model on my dataset with parameters: TPS 、ResNet、 BiLSTM 、Attn 、sensitive 、adam. The loss converged quickly but the accuracy always stays zero.
However, the accuracy would grow when I set the optimizer Adadelta.
Could you give me some advises?

recog error using TPS-ResNet、VGG-BiLSTM-Attn

The sample images
和兰豪华感，而风上的 wrong
内饰设局觉觉温馨范儿 wrong
后扭力梁非独立悬架 correct
变之水波落务变得更出 wrong

I train the model using 32X256, then set batch_max_length=64(test and train)，I feel something has wrong，when the character has many in the sample，the result is wrong。

The traing datasets is normal。

Thanks

Time it takes to train ResNet + CTC and VGG + CTC

Hi folks,

You guys did a great job of comparing and contrasting different module effects on text recognition! I'm currently trying to train a fast model, so I'm trying to train ResNet + CTC and VGG + CTC on my own. Using the default setting from your training script, I just wonder how long does the training take to reach ~70% Accuracy as shown in Table 8?

By the way, have you tried out using MobileNet-V2 as the backbone? Using the provided settings it seems I can't get MobileNet-V2 + CTC pass ~40% accuracy.

Unexpected key(s) in state_dict

After training my model, i'm having trouble testing my model on images. I'm getting this error message

RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.Prediction.attention_cell.i2h.weight", "module.Prediction.attention_cell.h2h.weight", "module.Prediction.attention_cell.h2h.bias", "module.Prediction.attention_cell.score.weight", "module.Prediction.attention_cell.rnn.weight_ih", "module.Prediction.attention_cell.rnn.weight_hh", "module.Prediction.attention_cell.rnn.bias_ih", "module.Prediction.attention_cell.rnn.bias_hh", "module.Prediction.generator.weight", "module.Prediction.generator.bias".
Unexpected key(s) in state_dict: "module.Prediction.weight", "module.Prediction.bias".

any advice on this would be helpful

using this model for a scene with multiple lines of text

Once this model is trained, when feeding an image through demo.py I can only get a word as the output as best. How can I change this code / what am I doing wrong if I want to use this to recognize images with a bunch of words on it.

vertical text recognition

First of all, thank you so much for such a wonderful job,but i try to test some vertical text,cannot detect any character,if need some train Specifically for it or model have some defects?

different accuracy between paper and competition website

Hi Author,
Great work from you, and thanks for the sharing.
I noted that the accuracy of your best model on IC13 is 93.6% in the paper, while it's 95.98% on the robust reading competition website.
Could you please explain about this difference?
Thanks.

create_dataset

Hello,I can not open the dataset link,so can you provide the create_dataset.py that i can create my own training dataset?

Access to other models?

Hi, can you provide other models for ex. the one with CTC instead of attn.

Image's width should be fixed?

Does train images' width should be fixed?

best accuracy: 0.000 after train 200000 steps

My dataset is image contain 14 number characters (for example). I config my opt like below:

character: 0123456789-
sensitive: False
PAD: True
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 3
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 13

but when i train 200.000 step, the accuracy still 0%.

Does ctc requires a blank character ?

Does in the alphabet I have to add a blank character to have ctc working properly ? Or can I use the same alphabet as the same one foe the attention prediction?

Whether you will give the best models?

Can't training model with own lmdb dataset

I have a problem training model with own lmdb dataset. I use create_lmdb_dataset.py with 1000 sample Vietnamese to create database. When I training model,
dataset_root: data/training
opt.select_data: ['ST']
opt.batch_ratio: ['0.5']

dataset_root: data/training dataset: ST
sub-directory: /ST num samples: 3
num total samples of ST: 3 x 1.0 (total_data_usage_ratio) = 3
num samples of ST per batch: 192 x 0.5 (batch_ratio) = 96

Total_batch_size: 96 = 96

Can you please tell me how to training own database. Thank you

Can the model be used for Chinese STR case retrained?

Hey, I just found that you've done a really fantastic job.
Seems like it works on English, Korean & Japanese STR.
So...if I retrained the model with Chinese STR datasets(like RCTW, etc.), can it still work?
I don't know if you've tried such work. It'll be great if I could get more suggestions.

miss match in size

RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.Prediction.attention_cell.rnn.weight_ih: copying a param with shape torch.Size([1024, 352]) from checkpoint, the shape in current model is torch.Size([1024, 294]).
size mismatch for module.Prediction.generator.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([38]).
size mismatch for module.Prediction.generator.weight: copying a param with shape torch.Size([96, 256]) from checkpoint, the shape in current model is torch.Size([38, 256]).

Does pretrained model support korean?

I want to check the result with using Korean images.
If it supports Korean, how can I get the results in Korean?

new question

I don't know if you can provide the code to process the mat file in http://www.robots.ox.ac.uk/~vgg/data/scenetext/.
There is no documentation in it. I don't know how to deal with this file.

It is so wired.
The 'wordBB' of 8/ballet_106_0.jpg has 15 points, as we know 8 points is enough to describ a box,but why it provide 15 points

Do you have any plans to release the pretrained model on ICDAR2019 ReCTS (task1)?

No bn layers in CRNN ?

Hi, I found the CRNN network implement in the project is different from the original one, which has a bn layer following each conv2d layer. I wonder why you remove them. Thanks.

Accuracy difference between local retraining model and pretrained one

First, thanks for your great work :) ! You've done a good job!

Here's my question, I've retrained the model with the option as:
"--select_data MJ-ST --batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC"
, corresponding to the original version of CRNN. The rest parameters are set as default and the model is trained on MJ and ST datasets.

However, when testing with my local retrained best_accuracy model, the result accuracy is shown as below:
in IC13_857: only 88.45% while 91.1% in paper.
in IC13_1015: 87.68% while 89.2% in paper.
in IC15_1811: 66.37% while 69.4% in paper.
in IC15_2077: 64.07% while 64.2% in paper.

It seems like there is still something inappropriate in my retraining process. Should I reset the learning rate or expand my training iteration? Do you guys have any idea about improving the performance to align with the public results illustrated in the paper?

And I've attempted to train only on MJ dataset, whose model seems to have a higher accuracy in IC13_857. When I extend the training on both MJ and ST, is it necessary to add up the iteration number, so that I can get a better accuracy?

Expect for your reply ^_^

requires_grad

Hi,
In train.py and test.py there's a manual loop for setting requires_grad=True/False. Is it really necessary? Why don't use with torch.no_grad() context for whole validation part? On the other hand such context is used in test.validation(), but just for input tensors, when it's unnecessary, because tensors by default have requires_grad=False.
Sorry, I don't want to be too pedantic, in the end everything works fine, but just for the clarity :)

Regards,

strange accuracy in IC03

I found that when using Attn, the result is always better than using CTC except in None+ResNet+BiLSTM and testing on IC03. What is possible interpretation of this result?

Inference code

Hi, bravo for the great findings!

Do you have any inference script?

Thanks.

when i running train.py,have an error ValueError: num_samples should be a positive integer value, but got num_samples=0

dataset_root: ./result
opt.select_data: ['/']
opt.batch_ratio: ['1.0']

dataset_root: ./result dataset: /
sub-directory: /. num samples: 0
num total samples of /: 0 x 1.0 (total_data_usage_ratio) = 0
num samples of / per batch: 192 x 1.0 (batch_ratio) = 192
Traceback (most recent call last):
File "train.py", line 279, in
train(opt)
File "train.py", line 25, in train
train_dataset = Batch_Balanced_Dataset(opt)
File "/home/user/桌面/deep-text-recognition-benchmark-master/dataset.py", line 59, in init
collate_fn=_AlignCollate, pin_memory=True)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 176, in init
sampler = RandomSampler(dataset)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 66, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

fixed image size

whether resize to 32×100 matter if the text image has a large width and height ratio when the image has a large width. for example, one image is 32×80, another is:32×300. is this also use _AlignCollate? looking for your replay.

training is ZeroDivisionError:division by zero and how to automatically generate gt.txt file

Hi, How to solve ZeroDivisionError
is the error caused by the small dataset?

Also, When you need to create lmdb dataset,
how to automatically generate gt.txt file

CUDA_VISIBLE_DEVICES=0 python train.py \

--train_data result_training_cp/ --valid_data result_validation_cp/ --select_data / --batch_ratio 1
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

dataset_root: result_training_cp/
opt.select_data: ['/']
opt.batch_ratio: ['1']

dataset_root: result_training_cp/ dataset: /
sub-directory: /. num samples: 735
num total samples of /: 735 x 1.0 (total_data_usage_ratio) = 735
num samples of / per batch: 192 x 1.0 (batch_ratio) = 192

Total_batch_size: 192 = 192

dataset_root: result_validation_cp/ dataset: /
sub-directory: /. num samples: 203

model input parameters 32 100 20 1 512 256 38 25 TPS ResNet BiLSTM Attn
Skip Transformation.LocalizationNetwork.localization_fc2.weight as it is already initialized
Skip Transformation.LocalizationNetwork.localization_fc2.bias as it is already initialized
Model:
DataParallel(
(module): Model(
(Transformation): TPS_SpatialTransformerNetwork(
(LocalizationNetwork): LocalizationNetwork(
(conv): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): ReLU(inplace)
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(8): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(9): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): ReLU(inplace)
(11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(12): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(13): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(14): ReLU(inplace)
(15): AdaptiveAvgPool2d(output_size=1)
)
(localization_fc1): Sequential(
(0): Linear(in_features=512, out_features=256, bias=True)
(1): ReLU(inplace)
)
(localization_fc2): Linear(in_features=256, out_features=40, bias=True)
)
(GridGenerator): GridGenerator()
)
(FeatureExtraction): ResNet_FeatureExtractor(
(ConvNet): ResNet(
(conv0_1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn0_1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv0_2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn0_2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool3): MaxPool2d(kernel_size=2, stride=(2, 1), padding=(0, 1), dilation=1, ceil_mode=False)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(2): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(3): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(4): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(2): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv4_1): Conv2d(512, 512, kernel_size=(2, 2), stride=(2, 1), padding=(0, 1), bias=False)
(bn4_1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv4_2): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1), bias=False)
(bn4_2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(AdaptiveAvgPool): AdaptiveAvgPool2d(output_size=(None, 1))
(SequenceModeling): Sequential(
(0): BidirectionalLSTM(
(rnn): LSTM(512, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=256, bias=True)
)
(1): BidirectionalLSTM(
(rnn): LSTM(256, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=256, bias=True)
)
)
(Prediction): Attention(
(attention_cell): AttentionCell(
(i2h): Linear(in_features=256, out_features=256, bias=False)
(h2h): Linear(in_features=256, out_features=256, bias=True)
(score): Linear(in_features=256, out_features=1, bias=False)
(rnn): LSTMCell(294, 256)
)
(generator): Linear(in_features=256, out_features=38, bias=True)
)
)
)
Trainable params num : 49555182
Optimizer:
Adadelta (
Parameter Group 0
eps: 1e-08
lr: 1
rho: 0.95
weight_decay: 0
)
------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: result_training_cp/
valid_data: result_validation_cp/
manualSeed: 1111
workers: 4
batch_size: 192
num_iter: 300000
valInterval: 2000
continue_model:
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['/']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 25
imgH: 32
imgW: 100
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyz
sensitive: False
PAD: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 38

[0/300000] Loss: 3.74243 elapsed_time: 15.06362
Traceback (most recent call last):
File "train.py", line 278, in
train(opt)
File "train.py", line 160, in train
model, criterion, valid_loader, converter, opt)
File "/home/cloudera1/deep/yc_reco/yc_reco/test.py", line 129, in validation
norm_ED += edit_distance(pred, gt) / len(gt)
ZeroDivisionError: division by zero

Can't seem to recognize numbers

Hi, I was testing with number images but they are misrecognized as letters. Can you provide the model to recognize numbers too?
Thanks a lot

quadrangle / rotated bbox

Hi,
Thank you for this repo and great survey of current state-of-art methods.
I wonder what was your approach for annotations like ICDAR which are in form of quadrangles (usually it's rotated bbox). Have you de-rotated it first or just use bigger axis aligned bbox?
On the other hand, modern detection methods are able to predict rotation or even masks. So spatial transformation in recognition step is not necessary?

Records,

Difference in performance between online demo website and the offline code

Hi,
I have been trying to run the code on these two images:

I get correct results on the demo website https://demo.ocr.clova.ai/,
This is the result I get through the offline code:
First Image: he11505973
Second Image: hijo86

Thoughts about generated/real images ratio

I am training your model on custom data and I am using transfer learning for the feature extraction part with Resnet-152.

To train the model I use synthetic data since what I will recognize has a limited range and I use also real data. I would like to know what is a good ratio synthetic data/real data and when I can consider that my model as seen enough of synthetic data to train on the real one.

Best

why error occured create_lmdb_dataset?

I created lmdb korean dataset in 9000 samples, but I have some problem.
I already set opt.character in train.py
parser.add_argument('--character', type=str, default='가각간갇갈감갑값갓강갖같갚갛개객

error occured 7
error occured 225
error occured 523
error occured 760
error occured 846
Written 1000 / 9000
...
error occured 8210
error occured 8289
error occured 8345
error occured 8497
error occured 8807
error occured 8899
error occured 8973
Created dataset with 8937 samples

somebody help me!

use the demo script，GPU OOM

use the demo script，
if you have more images, such as 2000 images,
raise GPU OOM
can anyone help solve the problem?
thanks

error of training

when i use ST_spe and validation from https://drive.google.com/drive/folders/192UfE9agQUMNq6AgU3_E05_FcPZK4hyt

CUDA_VISIBLE_DEVICES=2 python3 train.py --train_data data/ST_spe --valid_data data/validation --select_data / --batch_ratio 1  --Transformation TPS --sensitive  --workers 0  --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

The training seems to be stuck

[0/300000] Loss: 4.57599 elapsed_time: 3.90644
TTkkkkkkk//////////////, gt: SANDS               ,   False
T7777777777777777777777, gt: PULL                ,   False
TTTTeeeeTeTeTeTeTeTeTeT, gt: DIRECTION           ,   False
TTTTTTTTTTTTTTTTTTTTTTT, gt: ALTERNATE           ,   False
TT77`7`7`777777,777,,77, gt: FOR                 ,   False
[0/300000] valid loss: 4.65960 accuracy: 0.000, norm_ED: 36773.78
best_accuracy: 0.000, best_norm_ED: 36773.78

log:

------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: data/ST_spe
valid_data: data/validation
manualSeed: 1111
workers: 0
batch_size: 64
num_iter: 300000
valInterval: 2000
continue_model: 
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['/']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 23
imgH: 32
imgW: 280
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
sensitive: True
PAD: False
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 96
---------------------------------------

Thank you for your reply.

Results different with online Demo

Hi,
Thanks a lot for this great effort.

However in, using pretrained model provided and outputs inferred on the demo site (https://demo.ocr.clova.ai) the results vary greatly.

The online demo is able to recognize the text but the local inference using TPS-ResNet-BiLSTM-Attn.pth provided fails miserably.

Can you please elaborate about extra steps taken on demo site model ? Or any other extra pre-processing/post processing steps involved.

Testimages

Does the model support variable length recognition？I train the model using Chinese STR Datasets，but have a bad accuracy

Does the model support variable length recognition，I train the model using Chinese STR Datasets，but have a bad accuracy
I use TPS-ResNet、VGG-BiLSTM-Attn。
What should I do to have a normal accuracy？

Train --sensitive mode with the MJSynth + ST_spe

Hi,
Great work from you, and thanks for the sharing.
I trained the model on --sensitive mode with the MJSynth + ST_spe . here are the training accuracy and loss:
[79000/300000] Loss: 0.09761 elapsed_time: 103154.51228
FOR, , gt: FOR , False
Mo-vediplon , gt: MOVCD1PLO1 , False
REBATE , gt: REBATE , True
WALK , gt: WALK , True
NISSAN-SBI , gt: NISSANSBI , False
[79000/300000] valid loss: 2.17037 accuracy: 58.939, norm_ED: 2006.77
best_accuracy: 60.812, best_norm_ED: 1922.32

**model gives constant accuracy from this iteration **

[47500/300000] valid loss: 2.29950 accuracy: 56.822, norm_ED: 2124.85
best_accuracy: 60.812, best_norm_ED: 1922.32
[48000/300000] Loss: 0.11939 elapsed_time: 62696.96234
proudly , gt: proudly , True
LOVE , gt: LOVE , True
SAXONS , gt: SAXONS , True
TESCO , gt: TESCO , True

------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: data_lmdb_release/training
valid_data: data_lmdb_release/validation
manualSeed: 1111
workers: 4
batch_size: 192
num_iter: 300000
valInterval: 500
continue_model:
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['MJ', 'ST_spe']
batch_ratio: ['0.5', '0.5']
total_data_usage_ratio: 1.0
batch_max_length: 25
imgH: 32
imgW: 100
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"
sensitive: True
PAD: False
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 96

should i stop training ...
What should I do to have a good accuracy？

clovaai / deep-text-recognition-benchmark Goto Github PK

deep-text-recognition-benchmark's People

Contributors

Stargazers

Watchers

Forkers

deep-text-recognition-benchmark's Issues

I have a problem training model with own lmdb dataset. I use create_lmdb_dataset.py with 1000 sample Vietnamese to create database. When I training model, dataset_root: data/training opt.select_data: ['ST'] opt.batch_ratio: ['0.5']

dataset_root: data/training dataset: ST sub-directory: /ST num samples: 3 num total samples of ST: 3 x 1.0 (total_data_usage_ratio) = 3 num samples of ST per batch: 192 x 0.5 (batch_ratio) = 96

dataset_root: ./result opt.select_data: ['/'] opt.batch_ratio: ['1.0']

dataset_root: result_training_cp/ opt.select_data: ['/'] opt.batch_ratio: ['1']

dataset_root: result_training_cp/ dataset: / sub-directory: /. num samples: 735 num total samples of /: 735 x 1.0 (total_data_usage_ratio) = 735 num samples of / per batch: 192 x 1.0 (batch_ratio) = 192

Total_batch_size: 192 = 192

dataset_root: result_validation_cp/ dataset: / sub-directory: /. num samples: 203

Recommend Projects

Recommend Topics