clovaai / deep-text-recognition-benchmark Goto Github PK
View Code? Open in Web Editor NEWText recognition (optical character recognition) with deep learning methods, ICCV 2019
License: Apache License 2.0
Text recognition (optical character recognition) with deep learning methods, ICCV 2019
License: Apache License 2.0
model input parameters 32 100 20 1 512 256 96 25 TPS ResNet BiLSTM Attn
loading pretrained model from /home1/zy/STR/psenet_benchmark/benchmark/pretrained_model/TPS-ResNet-BiLSTM-Attn-case-sensitive.pth
Traceback (most recent call last):
File "/home1/zy/STR/psenet_benchmark/api.py", line 22, in
br(image_folder)
File "/home1/zy/STR/psenet_benchmark/api.py", line 15, in br
getrecognition(image_folder)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 121, in getrecognition
demo(opt)
File "/home1/zy/STR/psenet_benchmark/benchmark/getrecognition.py", line 64, in demo
preds = model(image, text_for_pred, is_train=False)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/model.py", line 82, in forward
contextual_feature = self.SequenceModeling(visual_feature)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./benchmark/modules/sequence_modeling.py", line 16, in forward
self.rnn.flatten_parameters()
File "/home1/zy/miniconda3/envs/STR/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()
Hi, thank you for sharing such great work! I am a novice and found sth i cannot understand:
The AttnLabelConverter
encodes the labels into text, length
. The text
has shape [batch_size, max_length+2]
where '2' means [go] and [stop].
When the text
pass into Attention.forward()
, however, the notes said text : the text-index of each image. [batch_size x (max_length+1)]. +1 for [GO] token. text[:, 0] = [GO].
Are these two text
the same object? If so, why the shape is different?
Thanks for you help!
I added a simple test image to the demo folder and it gets wrongly recognized.
The image:
The output:
128456782012
The command I used is the same of the README.
Thinking that the problem could have been related to the string length, I tried retraining the TPS-ResNet-BiLSTM-Attn model using an imgW of 200 pixels, but the problem seems to be very similar.
Any idea on why this happens? It seems to me that this image is much simpler compared to the other demo images.
I add some characters and fine tuned the pretrained model on my dataset with parameters: TPS 、ResNet、 BiLSTM 、Attn 、sensitive 、adam. The loss converged quickly but the accuracy always stays zero.
However, the accuracy would grow when I set the optimizer Adadelta.
Could you give me some advises?
Hi folks,
You guys did a great job of comparing and contrasting different module effects on text recognition! I'm currently trying to train a fast model, so I'm trying to train ResNet + CTC and VGG + CTC on my own. Using the default setting from your training script, I just wonder how long does the training take to reach ~70% Accuracy as shown in Table 8?
By the way, have you tried out using MobileNet-V2 as the backbone? Using the provided settings it seems I can't get MobileNet-V2 + CTC pass ~40% accuracy.
After training my model, i'm having trouble testing my model on images. I'm getting this error message
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.Prediction.attention_cell.i2h.weight", "module.Prediction.attention_cell.h2h.weight", "module.Prediction.attention_cell.h2h.bias", "module.Prediction.attention_cell.score.weight", "module.Prediction.attention_cell.rnn.weight_ih", "module.Prediction.attention_cell.rnn.weight_hh", "module.Prediction.attention_cell.rnn.bias_ih", "module.Prediction.attention_cell.rnn.bias_hh", "module.Prediction.generator.weight", "module.Prediction.generator.bias".
Unexpected key(s) in state_dict: "module.Prediction.weight", "module.Prediction.bias".
any advice on this would be helpful
Once this model is trained, when feeding an image through demo.py I can only get a word as the output as best. How can I change this code / what am I doing wrong if I want to use this to recognize images with a bunch of words on it.
First of all, thank you so much for such a wonderful job,but i try to test some vertical text,cannot detect any character,if need some train Specifically for it or model have some defects?
Hi Author,
Great work from you, and thanks for the sharing.
I noted that the accuracy of your best model on IC13 is 93.6% in the paper, while it's 95.98% on the robust reading competition website.
Could you please explain about this difference?
Thanks.
Hello,I can not open the dataset link,so can you provide the create_dataset.py that i can create my own training dataset?
Hi, can you provide other models for ex. the one with CTC instead of attn.
Does train images' width should be fixed?
My dataset is image contain 14 number characters (for example). I config my opt like below:
character: 0123456789-
sensitive: False
PAD: True
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 3
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 13
but when i train 200.000 step, the accuracy still 0%.
Does in the alphabet I have to add a blank character to have ctc working properly ? Or can I use the same alphabet as the same one foe the attention prediction?
Whether you will give the best models?
Total_batch_size: 96 = 96
Can you please tell me how to training own database. Thank you
Hey, I just found that you've done a really fantastic job.
Seems like it works on English, Korean & Japanese STR.
So...if I retrained the model with Chinese STR datasets(like RCTW, etc.), can it still work?
I don't know if you've tried such work. It'll be great if I could get more suggestions.
RuntimeError: Error(s) in loading state_dict for DataParallel:
size mismatch for module.Prediction.attention_cell.rnn.weight_ih: copying a param with shape torch.Size([1024, 352]) from checkpoint, the shape in current model is torch.Size([1024, 294]).
size mismatch for module.Prediction.generator.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([38]).
size mismatch for module.Prediction.generator.weight: copying a param with shape torch.Size([96, 256]) from checkpoint, the shape in current model is torch.Size([38, 256]).
I want to check the result with using Korean images.
If it supports Korean, how can I get the results in Korean?
I don't know if you can provide the code to process the mat file in http://www.robots.ox.ac.uk/~vgg/data/scenetext/.
There is no documentation in it. I don't know how to deal with this file.
It is so wired.
The 'wordBB' of 8/ballet_106_0.jpg has 15 points, as we know 8 points is enough to describ a box,but why it provide 15 points
Hi, I found the CRNN network implement in the project is different from the original one, which has a bn layer following each conv2d layer. I wonder why you remove them. Thanks.
First, thanks for your great work :) ! You've done a good job!
Here's my question, I've retrained the model with the option as:
"--select_data MJ-ST --batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC"
, corresponding to the original version of CRNN. The rest parameters are set as default and the model is trained on MJ and ST datasets.
However, when testing with my local retrained best_accuracy model, the result accuracy is shown as below:
in IC13_857: only 88.45% while 91.1% in paper.
in IC13_1015: 87.68% while 89.2% in paper.
in IC15_1811: 66.37% while 69.4% in paper.
in IC15_2077: 64.07% while 64.2% in paper.
It seems like there is still something inappropriate in my retraining process. Should I reset the learning rate or expand my training iteration? Do you guys have any idea about improving the performance to align with the public results illustrated in the paper?
And I've attempted to train only on MJ dataset, whose model seems to have a higher accuracy in IC13_857. When I extend the training on both MJ and ST, is it necessary to add up the iteration number, so that I can get a better accuracy?
Expect for your reply ^_^
Hi,
In train.py and test.py there's a manual loop for setting requires_grad=True/False
. Is it really necessary? Why don't use with torch.no_grad()
context for whole validation part? On the other hand such context is used in test.validation()
, but just for input tensors, when it's unnecessary, because tensors by default have requires_grad=False
.
Sorry, I don't want to be too pedantic, in the end everything works fine, but just for the clarity :)
Regards,
I found that when using Attn, the result is always better than using CTC except in None+ResNet+BiLSTM and testing on IC03. What is possible interpretation of this result?
Hi, bravo for the great findings!
Do you have any inference script?
Thanks.
dataset_root: ./result dataset: /
sub-directory: /. num samples: 0
num total samples of /: 0 x 1.0 (total_data_usage_ratio) = 0
num samples of / per batch: 192 x 1.0 (batch_ratio) = 192
Traceback (most recent call last):
File "train.py", line 279, in
train(opt)
File "train.py", line 25, in train
train_dataset = Batch_Balanced_Dataset(opt)
File "/home/user/桌面/deep-text-recognition-benchmark-master/dataset.py", line 59, in init
collate_fn=_AlignCollate, pin_memory=True)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 176, in init
sampler = RandomSampler(dataset)
File "/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 66, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
whether resize to 32×100 matter if the text image has a large width and height ratio when the image has a large width. for example, one image is 32×80, another is:32×300. is this also use _AlignCollate? looking for your replay.
Hi, How to solve ZeroDivisionError
is the error caused by the small dataset?
Also, When you need to create lmdb dataset,
how to automatically generate gt.txt file
CUDA_VISIBLE_DEVICES=0 python train.py \
--train_data result_training_cp/ --valid_data result_validation_cp/ --select_data / --batch_ratio 1
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn
[0/300000] Loss: 3.74243 elapsed_time: 15.06362
Traceback (most recent call last):
File "train.py", line 278, in
train(opt)
File "train.py", line 160, in train
model, criterion, valid_loader, converter, opt)
File "/home/cloudera1/deep/yc_reco/yc_reco/test.py", line 129, in validation
norm_ED += edit_distance(pred, gt) / len(gt)
ZeroDivisionError: division by zero
Hi, I was testing with number images but they are misrecognized as letters. Can you provide the model to recognize numbers too?
Thanks a lot
Hi,
Thank you for this repo and great survey of current state-of-art methods.
I wonder what was your approach for annotations like ICDAR which are in form of quadrangles (usually it's rotated bbox). Have you de-rotated it first or just use bigger axis aligned bbox?
On the other hand, modern detection methods are able to predict rotation or even masks. So spatial transformation in recognition step is not necessary?
Records,
Hi,
I have been trying to run the code on these two images:
I get correct results on the demo website https://demo.ocr.clova.ai/,
This is the result I get through the offline code:
First Image: he11505973
Second Image: hijo86
I am training your model on custom data and I am using transfer learning for the feature extraction part with Resnet-152.
To train the model I use synthetic data since what I will recognize has a limited range and I use also real data. I would like to know what is a good ratio synthetic data/real data and when I can consider that my model as seen enough of synthetic data to train on the real one.
Best
I created lmdb korean dataset in 9000 samples, but I have some problem.
I already set opt.character in train.py
parser.add_argument('--character', type=str, default='가각간갇갈감갑값갓강갖같갚갛개객
error occured 7
error occured 225
error occured 523
error occured 760
error occured 846
Written 1000 / 9000
...
error occured 8210
error occured 8289
error occured 8345
error occured 8497
error occured 8807
error occured 8899
error occured 8973
Created dataset with 8937 samples
somebody help me!
use the demo script,
if you have more images, such as 2000 images,
raise GPU OOM
can anyone help solve the problem?
thanks
when i use ST_spe and validation from https://drive.google.com/drive/folders/192UfE9agQUMNq6AgU3_E05_FcPZK4hyt
CUDA_VISIBLE_DEVICES=2 python3 train.py --train_data data/ST_spe --valid_data data/validation --select_data / --batch_ratio 1 --Transformation TPS --sensitive --workers 0 --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn
The training seems to be stuck
[0/300000] Loss: 4.57599 elapsed_time: 3.90644
TTkkkkkkk//////////////, gt: SANDS , False
T7777777777777777777777, gt: PULL , False
TTTTeeeeTeTeTeTeTeTeTeT, gt: DIRECTION , False
TTTTTTTTTTTTTTTTTTTTTTT, gt: ALTERNATE , False
TT77`7`7`777777,777,,77, gt: FOR , False
[0/300000] valid loss: 4.65960 accuracy: 0.000, norm_ED: 36773.78
best_accuracy: 0.000, best_norm_ED: 36773.78
log:
------------ Options -------------
experiment_name: TPS-ResNet-BiLSTM-Attn-Seed1111
train_data: data/ST_spe
valid_data: data/validation
manualSeed: 1111
workers: 0
batch_size: 64
num_iter: 300000
valInterval: 2000
continue_model:
adam: False
lr: 1
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['/']
batch_ratio: ['1']
total_data_usage_ratio: 1.0
batch_max_length: 23
imgH: 32
imgW: 280
rgb: False
character: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
sensitive: True
PAD: False
data_filtering_off: False
Transformation: TPS
FeatureExtraction: ResNet
SequenceModeling: BiLSTM
Prediction: Attn
num_fiducial: 20
input_channel: 1
output_channel: 512
hidden_size: 256
num_gpu: 1
num_class: 96
---------------------------------------
Thank you for your reply.
Hi,
Thanks a lot for this great effort.
However in, using pretrained model provided and outputs inferred on the demo site (https://demo.ocr.clova.ai) the results vary greatly.
The online demo is able to recognize the text but the local inference using TPS-ResNet-BiLSTM-Attn.pth provided fails miserably.
Can you please elaborate about extra steps taken on demo site model ? Or any other extra pre-processing/post processing steps involved.
Does the model support variable length recognition,I train the model using Chinese STR Datasets,but have a bad accuracy
I use TPS-ResNet、VGG-BiLSTM-Attn。
What should I do to have a normal accuracy?
Hi,
Great work from you, and thanks for the sharing.
I trained the model on --sensitive mode with the MJSynth + ST_spe . here are the training accuracy and loss:
[79000/300000] Loss: 0.09761 elapsed_time: 103154.51228
FOR, , gt: FOR , False
Mo-vediplon , gt: MOVCD1PLO1 , False
REBATE , gt: REBATE , True
WALK , gt: WALK , True
NISSAN-SBI , gt: NISSANSBI , False
[79000/300000] valid loss: 2.17037 accuracy: 58.939, norm_ED: 2006.77
best_accuracy: 60.812, best_norm_ED: 1922.32
**model gives constant accuracy from this iteration **
[47500/300000] valid loss: 2.29950 accuracy: 56.822, norm_ED: 2124.85
best_accuracy: 60.812, best_norm_ED: 1922.32
[48000/300000] Loss: 0.11939 elapsed_time: 62696.96234
proudly , gt: proudly , True
LOVE , gt: LOVE , True
SAXONS , gt: SAXONS , True
TESCO , gt: TESCO , True
should i stop training ...
What should I do to have a good accuracy?
How can we make this code compatible with CUDA v10 ??
Hi,
how to train use mutil gpus in this project? try to set CUDA_VISIBLE_DEVICES=0,1, but it is wrong.
best
I have a question about the code with attention,the hidden state is always zeros,and in my think,it should be a encoded hidden state of batch_H with lstm , and in you code I can't think what function the rnn_cell do?
the best models case insensitive?can you give case sensitive best models?
Can you please tell me how to convert the model to ONNX. Thank you
How can I train and test these models on my own data?
I'm getting this error when trying to train on default data, any information is helpful!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.