tesseract-ocr / tessdata_fast Goto Github PK
View Code? Open in Web Editor NEWFast integer versions of trained LSTM models
License: Apache License 2.0
Fast integer versions of trained LSTM models
License: Apache License 2.0
Hi,
I see "network" in some your description, i'm witting an app running offline, there is no network. Does tesseract 4.0 can be used offline?
I maintain the Fedora tesseract-tessdata package. I'm unsure how to deal with the tessconfigs folder when packaging. The release archive [1] contains the empty submodule folder, and some broken symlinks which point to files inside the submodule. Is it safe to just ignore the folder and these files? Is there any worth pulling the tessconfigs folder separately and packaging it?
[1] https://github.com/tesseract-ocr/tessdata_fast/archive/4.1.0/tessdata_fast-4.1.0.tar.gz
On windows 7,I am trying to use Fast Traineddata files in my java project. But I am getting Invalid Memory access when using it even after setting datapath.
I have tried to use best data files but it also gives same error. Default data files are working but it is huge file so I was going for fast files.
Tesseract tess = new Tesseract();
tess.setDatapath("C:\\Users\\U6070534\\Downloads\\tess4j\\tessdata");
tess.setLanguage("eng");
String inputFilePath = "C:\\Users\\U6070534\\IdeaProjects\\ocrsample\\screenshot\\craft0.png";
try {
textpath.add(tess.doOCR(new File(inputFilePath)));
} catch (TesseractException e1) {
e1.printStackTrace();
}
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:437)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:292)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:213)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:197)
at OcrReader.main(OcrReader.java:25)
Failed loading language 'eng'
Tesseract couldn't load any languages!
Process finished with exit code 1
I am using tesseract 4.0 .net wrapper dll and tying to use tesseract_fast trained data, but it is not gettng instialized and it throws an unhandled exception (read write) . I tried with engine mode "Lstmonly" and "default" but couldn't intialize. Kindly help me whether i am using in a wrong way or i should not used this fast traineddata.
Error is as follows
I downloaded por.traineddata (for portuguese language) to the StreamingAssets/tessdata folder and I get a "TessAPIInit failed. Output: -1". Should this be enough? I notice there are a bunch of different files for starting with eng like "eng.cube.params", "eng.cube.nn" and so many others? Should there be the same for portuguese?
As emailed to the mailing list, does it make sense to tag another release?
> Hi,
>
> With Tesseract now switching to regular (alpha) releases of 5.0.0; does
> it make sense to consider some versioning for language files as well?
>
> The Internet Archive has switched to using Tesseract for all our OCR,
> and I'm hoping that we can record exactly what version of language files
> was used for a specific OCR job. Currently, the answer is simple, since
> we're using the default packages from Ubuntu focal, but I am working on
> switching to Tesseract release/tag 5.0.0-20201231.
>
> But the tessdata_fast (or tessdata_best, for that matter) do not seem to
> have any recent 5.x releases:
> https://github.com/tesseract-ocr/tessdata_fast/releases
>
> Are there plans to create a release/tag for the tessdata_* repositories?
>
> Cheers,
> Merlijn
And the follow-up:
On 27/01/2021 12:42, Shree Devi Kumar wrote:
>> The Internet Archive has switched to using Tesseract for all our OCR,
>
> I am so happy to hear this. It will be great to have the Indic languages
> that were marked as non-ocrable so far be converted to text correctly on
> Internet Archive.
>
> Is there any page with instructions to do this? Can a language be specified
> while OCRing? eg. Better results are many times received using
> script/Devanagari instead of san for Sanskrit.
>
> Regarding your question about tessdata, there have only been minor changes
> to tessdata files but adding a tag is a good idea. I suggest you post this
> as a feature request in the repo.
Installations on case insensitive filesystems (macOS, Windows, ...) cannot install both Lao.traineddata
and lao.traineddata
. It might also be confusing for users to know the difference between -l Lao
and -l lao
.
I suggest to rename the first one. Which name would be fine?
It would be useful to have an official 4.0.0 tag for the data as well.
what does "vert" traineddata do?
Please see details at
tesseract-ocr/tessdata#88 (comment)
tesseract-ocr/tessdata_best#23
@jbreiden @AlexanderP - FYI - regarding problem with packaged traineddata for kur_ara.
The chinese models chi_sim
, chi_sim_vert
, chi_tra
and chi_tra_vert
use chi
instead of zho
which is the ISO 639-3 standard.
Since all other files are named according to this standard I dont see a reason why this should not be the case for chinese.
Fast (integer version) trained LSTM models
super@super-lap:/home/tesseract-lstm$ bin/tesseract --psm 0 ./tessinput.tif stdout
Page 1
Segmentation fault (core dumped)
How to actually use these tessdata files?
Please provide a hint in the README.
Thank you
I'm asking this to clarify, as README was last modified before the 5.0.0 release.
According to the wiki, equ and osd trained data will reuse the 3.x data file. The weird thing is that osd is copied but equ is not. Is there any reason? e.g. equ is deprecated in 4.x
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.