Git Product home page Git Product logo

deep-text-recognition-benchmark's People

Contributors

akarazniewicz avatar boom1492 avatar clovaaiadmin avatar coallaoh avatar edwardpwtsoi avatar gwkrsrch avatar ku21fan avatar sangkwun avatar soonge avatar tgalkovskyi avatar tjdevworks avatar varshaneya avatar yacobby avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-text-recognition-benchmark's Issues

set_storage is not allowed on Tensor created from .data or .detach()

model input parameters 32 100 20 1 512 256 96 25 TPS ResNet BiLSTM Attn
loading pretrained model from /home1/zy/STR/psenet_benchmark/benchmark/pretrained_model/TPS-ResNet-BiLSTM-Attn-case-sensitive.pth
Traceback (most recent call last):
File "/home1/zy/STR/psenet_benchmark/api.py", line 22, in
br(image_folder)
File "/home1/zy/STR/psenet_benchmark/api.py", line 15, in br
getrecognition(image_folder)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 121, in getrecognition
demo(opt)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 64, in demo
preds = model(image, text_for_pred, is_train=False)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/model.py", line 82, in forward
contextual_feature = self.SequenceModeling(visual_feature)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/modules/sequence_modeling.py", line 16, in forward
self.rnn.flatten_parameters()
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()

test ctc model it is giving runtime error?

when I trained the algorithm with the prediction CTC and when I going for test demo.py I
IMG_20190725_201840

am getting the following error.

RuntimeError: storage has wrong size : expected -5510384958902273621 got 294912

The text shape when using Attention

Hi, thank you for sharing such great work! I am a novice and found sth i cannot understand:

The AttnLabelConverter encodes the labels into text, length. The text has shape [batch_size, max_length+2] where '2' means [go] and [stop].

When the text pass into Attention.forward(), however, the notes said text : the text-index of each image. [batch_size x (max_length+1)]. +1 for [GO] token. text[:, 0] = [GO].

Are these two text the same object? If so, why the shape is different?
Thanks for you help!

Simple image causing troubles

I added a simple test image to the demo folder and it gets wrongly recognized.
The image:
test_dbg
The output:
128456782012

The command I used is the same of the README.
Thinking that the problem could have been related to the string length, I tried retraining the TPS-ResNet-BiLSTM-Attn model using an imgW of 200 pixels, but the problem seems to be very similar.

Any idea on why this happens? It seems to me that this image is much simpler compared to the other demo images.

The loss converge but the accuracy always stay zero with optimizer adam.

I add some characters and fine tuned the pretrained model on my dataset with parameters: TPS 、ResNet、 BiLSTM 、Attn 、sensitive 、adam. The loss converged quickly but the accuracy always stays zero.
However, the accuracy would grow when I set the optimizer Adadelta.
Could you give me some advises?

recog error using TPS-ResNet、VGG-BiLSTM-Attn

The sample images
d_autohomecar__wKgHPltYXyWAZHTbAANpvWtj5Hs964_0 和兰豪华感,而风上的 wrong
d_autohomecar__wKgHPltYXyWAZHTbAANpvWtj5Hs964_1 内饰设局觉觉温馨范儿 wrong
d_autohomecar__wKgHPlstHCiANf7JAAGvh7L-4DU249_1 后扭力梁非独立悬架 correct
d_autohomecar__wKgHPlt2pX6AQNrVAANoQlFZZXQ045_0 变之水波落务变得更出 wrong

I train the model using 32X256, then set batch_max_length=64(test and train),I feel something has wrong,when the character has many in the sample,the result is wrong。

The traing datasets is normal。

Thanks

Time it takes to train ResNet + CTC and VGG + CTC

Hi folks,

You guys did a great job of comparing and contrasting different module effects on text recognition! I'm currently trying to train a fast model, so I'm trying to train ResNet + CTC and VGG + CTC on my own. Using the default setting from your training script, I just wonder how long does the training take to reach ~70% Accuracy as shown in Table 8?

By the way, have you tried out using MobileNet-V2 as the backbone? Using the provided settings it seems I can't get MobileNet-V2 + CTC pass ~40% accuracy.

Unexpected key(s) in state_dict

After training my model, i'm having trouble testing my model on images. I'm getting this error message

RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.Prediction.attention_cell.i2h.weight", "module.Prediction.attention_cell.h2h.weight", "module.Prediction.attention_cell.h2h.bias", "module.Prediction.attention_cell.score.weight", "module.Prediction.attention_cell.rnn.weight_ih", "module.Prediction.attention_cell.rnn.weight_hh", "module.Prediction.attention_cell.rnn.bias_ih", "module.Prediction.attention_cell.rnn.bias_hh", "module.Prediction.generator.weight", "module.Prediction.generator.bias".
Unexpected key(s) in state_dict: "module.Prediction.weight", "module.Prediction.bias".

any advice on this would be helpful

using this model for a scene with multiple lines of text

Once this model is trained, when feeding an image through demo.py I can only get a word as the output as best. How can I change this code / what am I doing wrong if I want to use this to recognize images with a bunch of words on it.

vertical text recognition

First of all, thank you so much for such a wonderful job,but i try to test some vertical text,cannot detect any character,if need some train Specifically for it or model have some defects?

create_dataset

Hello,I can not open the dataset link,so can you provide the create_dataset.py that i can create my own training dataset?

best accuracy: 0.000 after train 200000 steps

My dataset is image contain 14 number characters (for example). I config my opt like below:

character: 0123456789-
sensitive: False
PAD: True
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 3
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 13

but when i train 200.000 step, the accuracy still 0%.

Does ctc requires a blank character ?

Does in the alphabet I have to add a blank character to have ctc working properly ? Or can I use the same alphabet as the same one foe the attention prediction?

Can't training model with own lmdb dataset

I have a problem training model with own lmdb dataset. I use create_lmdb_dataset.py with 1000 sample Vietnamese to create database. When I training model,
dataset_root: data/training
opt.select_data: ['ST']
opt.batch_ratio: ['0.5']

dataset_root: data/training dataset: ST
sub-directory: /ST num samples: 3
num total samples of ST: 3 x 1.0 (total_data_usage_ratio) = 3
num samples of ST per batch: 192 x 0.5 (batch_ratio) = 96

Total_batch_size: 96 = 96

Can you please tell me how to training own database. Thank you

Can the model be used for Chinese STR case retrained?

Hey, I just found that you've done a really fantastic job.
Seems like it works on English, Korean & Japanese STR.
So...if I retrained the model with Chinese STR datasets(like RCTW, etc.), can it still work?
I don't know if you've tried such work. It'll be great if I could get more suggestions.

miss match in size

RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.Prediction.attention_cell.rnn.weight_ih: copying a param with shape torch.Size([1024, 352]) from checkpoint, the shape in current model is torch.Size([1024, 294]).
size mismatch for module.Prediction.generator.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([38]).
size mismatch for module.Prediction.generator.weight: copying a param with shape torch.Size([96, 256]) from checkpoint, the shape in current model is torch.Size([38, 256]).

new question

I don't know if you can provide the code to process the mat file in http://www.robots.ox.ac.uk/~vgg/data/scenetext/.
There is no documentation in it. I don't know how to deal with this file.

It is so wired.
The 'wordBB' of 8/ballet_106_0.jpg has 15 points, as we know 8 points is enough to describ a box,but why it provide 15 points

No bn layers in CRNN ?

Hi, I found the CRNN network implement in the project is different from the original one, which has a bn layer following each conv2d layer. I wonder why you remove them. Thanks.

Accuracy difference between local retraining model and pretrained one

First, thanks for your great work :) ! You've done a good job!

Here's my question, I've retrained the model with the option as:
"--select_data MJ-ST --batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC"
, corresponding to the original version of CRNN. The rest parameters are set as default and the model is trained on MJ and ST datasets.

However, when testing with my local retrained best_accuracy model, the result accuracy is shown as below:
in IC13_857: only 88.45% while 91.1% in paper.
in IC13_1015: 87.68% while 89.2% in paper.
in IC15_1811: 66.37% while 69.4% in paper.
in IC15_2077: 64.07% while 64.2% in paper.

It seems like there is still something inappropriate in my retraining process. Should I reset the learning rate or expand my training iteration? Do you guys have any idea about improving the performance to align with the public results illustrated in the paper?

And I've attempted to train only on MJ dataset, whose model seems to have a higher accuracy in IC13_857. When I extend the training on both MJ and ST, is it necessary to add up the iteration number, so that I can get a better accuracy?

Expect for your reply ^_^

requires_grad

Hi,
In train.py and test.py there's a manual loop for setting requires_grad=True/False. Is it really necessary? Why don't use with torch.no_grad() context for whole validation part? On the other hand such context is used in test.validation(), but just for input tensors, when it's unnecessary, because tensors by default have requires_grad=False.
Sorry, I don't want to be too pedantic, in the end everything works fine, but just for the clarity :)

Regards,

strange accuracy in IC03

I found that when using Attn, the result is always better than using CTC except in None+ResNet+BiLSTM and testing on IC03. What is possible interpretation of this result?

Inference code

Hi, bravo for the great findings!

Do you have any inference script?

Thanks.

when i running train.py,have an error ValueError: num_samples should be a positive integer value, but got num_samples=0

dataset_root: ./result
opt.select_data: ['/']
opt.batch_ratio: ['1.0']

dataset_root: ./result dataset: /
sub-directory: /. num samples: 0
num total samples of /: 0 x 1.0 (total_data_usage_ratio) = 0
num samples of / per batch: 192 x 1.0 (batch_ratio) = 192
Traceback (most recent call last):
File "train.py", line 279, in
train(opt)
File "train.py", line 25, in train
train_dataset = Batch_Balanced_Dataset(opt)
File "/home/user/桌面/deep-text-recognition-benchmark-master/dataset.py", line 59, in init
collate_fn=_AlignCollate, pin_memory=True)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 176, in init
sampler = RandomSampler(dataset)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 66, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

fixed image size

whether resize to 32×100 matter if the text image has a large width and height ratio when the image has a large width. for example, one image is 32×80, another is:32×300. is this also use _AlignCollate? looking for your replay.

training is ZeroDivisionError:division by zero and how to automatically generate gt.txt file

Hi, How to solve ZeroDivisionError
is the error caused by the small dataset?

Also, When you need to create lmdb dataset,
how to automatically generate gt.txt file


CUDA_VISIBLE_DEVICES=0 python train.py \

--train_data result_training_cp/ --valid_data result_validation_cp/ --select_data / --batch_ratio 1
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn


dataset_root: result_training_cp/
opt.select_data: ['/']
opt.batch_ratio: ['1']

dataset_root: result_training_cp/ dataset: /
sub-directory: /. num samples: 735
num total samples of /: 735 x 1.0 (total_data_usage_ratio) = 735
num samples of / per batch: 192 x 1.0 (batch_ratio) = 192

Total_batch_size: 192 = 192

dataset_root: result_validation_cp/ dataset: /
sub-directory: /. num samples: 203

model input parameters 32 100 20 1 512 256 38 25 TPS ResNet BiLSTM Attn
Skip Transformation.LocalizationNetwork.localization_fc2.weight as it is already initialized
Skip Transformation.LocalizationNetwork.localization_fc2.bias as it is already initialized
Model:
DataParallel(
(module): Model(
(Transformation): TPS_SpatialTransformerNetwork(
(LocalizationNetwork): LocalizationNetwork(
(conv): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): ReLU(inplace)
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(8): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(9): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): ReLU(inplace)
(11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(12): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(13): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(14): ReLU(inplace)
(15): AdaptiveAvgPool2d(output_size=1)
)
(localization_fc1): Sequential(
(0): Linear(in_features=512, out_features=256, bias=True)
(1): ReLU(inplace)
)
(localization_fc2): Linear(in_features=256, out_features=40, bias=True)
)
(GridGenerator): GridGenerator()
)
(FeatureExtraction): ResNet_FeatureExtractor(
(ConvNet): ResNet(
(conv0_1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn0_1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv0_2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn0_2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(maxpool3): MaxPool2d(kernel_size=2, stride=(2, 1), padding=(0, 1), dilation=1, ceil_mode=False)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(2): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(3): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(4): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(2): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
(conv4_1): Conv2d(512, 512, kernel_size=(2, 2), stride=(2, 1), padding=(0, 1), bias=False)
(bn4_1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv4_2): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1), bias=False)
(bn4_2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(AdaptiveAvgPool): AdaptiveAvgPool2d(output_size=(None, 1))
(SequenceModeling): Sequential(
(0): BidirectionalLSTM(
(rnn): LSTM(512, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=256, bias=True)
)
(1): BidirectionalLSTM(
(rnn): LSTM(256, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=256, bias=True)
)
)
(Prediction): Attention(
(attention_cell): AttentionCell(
(i2h): Linear(in_features=256, out_features=256, bias=False)
(h2h): Linear(in_features=256, out_features=256, bias=True)
(score): Linear(in_features=256, out_features=1, bias=False)
(rnn): LSTMCell(294, 256)
)
(generator): Linear(in_features=256, out_features=38, bias=True)
)
)
)
Trainable params num : 49555182
Optimizer:
Adadelta (
Parameter Group 0
eps: 1e-08
lr: 1
rho: 0.95
weight_decay: 0
)
------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: result_training_cp/
valid_data: result_validation_cp/
manualSeed: 1111
workers: 4
batch_size: 192
num_iter: 300000
valInterval: 2000
continue_model:
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['/']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 25
imgH: 32
imgW: 100
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyz
sensitive: False
PAD: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 38

[0/300000] Loss: 3.74243 elapsed_time: 15.06362
Traceback (most recent call last):
File "train.py", line 278, in
train(opt)
File "train.py", line 160, in train
model, criterion, valid_loader, converter, opt)
File "/home/cloudera1/deep/yc_reco/yc_reco/test.py", line 129, in validation
norm_ED += edit_distance(pred, gt) / len(gt)
ZeroDivisionError: division by zero

Can't seem to recognize numbers

Hi, I was testing with number images but they are misrecognized as letters. Can you provide the model to recognize numbers too?
Thanks a lot

quadrangle / rotated bbox

Hi,
Thank you for this repo and great survey of current state-of-art methods.
I wonder what was your approach for annotations like ICDAR which are in form of quadrangles (usually it's rotated bbox). Have you de-rotated it first or just use bigger axis aligned bbox?
On the other hand, modern detection methods are able to predict rotation or even masks. So spatial transformation in recognition step is not necessary?

Records,

Thoughts about generated/real images ratio

I am training your model on custom data and I am using transfer learning for the feature extraction part with Resnet-152.

To train the model I use synthetic data since what I will recognize has a limited range and I use also real data. I would like to know what is a good ratio synthetic data/real data and when I can consider that my model as seen enough of synthetic data to train on the real one.

Best

why error occured create_lmdb_dataset?

I created lmdb korean dataset in 9000 samples, but I have some problem.
I already set opt.character in train.py
parser.add_argument('--character', type=str, default='가각간갇갈감갑값갓강갖같갚갛개객

error occured 7
error occured 225
error occured 523
error occured 760
error occured 846
Written 1000 / 9000
...
error occured 8210
error occured 8289
error occured 8345
error occured 8497
error occured 8807
error occured 8899
error occured 8973
Created dataset with 8937 samples

somebody help me!

use the demo script,GPU OOM

use the demo script,
if you have more images, such as 2000 images,
raise GPU OOM
can anyone help solve the problem?
thanks

error of training

when i use ST_spe and validation from https://drive.google.com/drive/folders/192UfE9agQUMNq6AgU3_E05_FcPZK4hyt

CUDA_VISIBLE_DEVICES=2 python3 train.py --train_data data/ST_spe --valid_data data/validation --select_data / --batch_ratio 1  --Transformation TPS --sensitive  --workers 0  --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

The training seems to be stuck

[0/300000] Loss: 4.57599 elapsed_time: 3.90644
TTkkkkkkk//////////////, gt: SANDS               ,   False
T7777777777777777777777, gt: PULL                ,   False
TTTTeeeeTeTeTeTeTeTeTeT, gt: DIRECTION           ,   False
TTTTTTTTTTTTTTTTTTTTTTT, gt: ALTERNATE           ,   False
TT77`7`7`777777,777,,77, gt: FOR                 ,   False
[0/300000] valid loss: 4.65960 accuracy: 0.000, norm_ED: 36773.78
best_accuracy: 0.000, best_norm_ED: 36773.78

log:

------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: data/ST_spe
valid_data: data/validation
manualSeed: 1111
workers: 0
batch_size: 64
num_iter: 300000
valInterval: 2000
continue_model: 
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['/']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 23
imgH: 32
imgW: 280
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
sensitive: True
PAD: False
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 96
---------------------------------------

Thank you for your reply.

Results different with online Demo

Hi,
Thanks a lot for this great effort.

However in, using pretrained model provided and outputs inferred on the demo site (https://demo.ocr.clova.ai) the results vary greatly.

The online demo is able to recognize the text but the local inference using TPS-ResNet-BiLSTM-Attn.pth provided fails miserably.

Can you please elaborate about extra steps taken on demo site model ? Or any other extra pre-processing/post processing steps involved.

Testimages

Train --sensitive mode with the MJSynth + ST_spe

Hi,
Great work from you, and thanks for the sharing.
I trained the model on --sensitive mode with the MJSynth + ST_spe . here are the training accuracy and loss:
[79000/300000] Loss: 0.09761 elapsed_time: 103154.51228
FOR, , gt: FOR , False
Mo-vediplon , gt: MOVCD1PLO1 , False
REBATE , gt: REBATE , True
WALK , gt: WALK , True
NISSAN-SBI , gt: NISSANSBI , False
[79000/300000] valid loss: 2.17037 accuracy: 58.939, norm_ED: 2006.77
best_accuracy: 60.812, best_norm_ED: 1922.32

**model gives constant accuracy from this iteration **

[47500/300000] valid loss: 2.29950 accuracy: 56.822, norm_ED: 2124.85
best_accuracy: 60.812, best_norm_ED: 1922.32
[48000/300000] Loss: 0.11939 elapsed_time: 62696.96234
proudly , gt: proudly , True
LOVE , gt: LOVE , True
SAXONS , gt: SAXONS , True
TESCO , gt: TESCO , True

------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: data_lmdb_release/training
valid_data: data_lmdb_release/validation
manualSeed: 1111
workers: 4
batch_size: 192
num_iter: 300000
valInterval: 500
continue_model:
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['MJ', 'ST_spe']
batch_ratio: ['0.5', '0.5']
total_data_usage_ratio: 1.0
batch_max_length: 25
imgH: 32
imgW: 100
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"
sensitive: True
PAD: False
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 96

should i stop training ...
What should I do to have a good accuracy?

Multi GPU training

Hi,
how to train use mutil gpus in this project? try to set CUDA_VISIBLE_DEVICES=0,1, but it is wrong.
best

probelm about attention

I have a question about the code with attention,the hidden state is always zeros,and in my think,it should be a encoded hidden state of batch_H with lstm , and in you code I can't think what function the rnn_cell do?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.