tesseract-ocr / tessdata_fast Goto Github PK

View Code? Open in Web Editor NEW

440.0 440.0 135.0 335.91 MB

Fast integer versions of trained LSTM models

License: Apache License 2.0

ocr tesseract

tessdata_fast's People

Contributors

Stargazers

Watchers

Forkers

jenner4s stweil gyan111 ctociojosh dxlin glennneiger crazypenguincode idsauresmohamed keithkim shanleiyang hayekzh epoulsen bash-master peoplemakeculture ming-hai xuliang482 kashenfelter lottid sanyaade-machine-learning shannah delphigeek tarsbase brlin-tw jonathanzhao thongvm ellery92 jiliang68 fengzifrank denniskc qqbenst cnzhujg yaracil lxygoodjob inwardinsight huyq brossaip rishi-arch banghorroh wcxyingxiang colluslau chiangtor smile2049 yohannleee h2-ml-ocr jczorch toxomo zealzheng wchw11 ra2003 274869388 cytherea888 anderkaisa 1144790758 crazyinstall lakecenter gerhobbelt apostolosef louk78 xietian5 mistergrizzlydev kindtang muratobali0 jn7163 wulin-challenge kalvstranger siriquelle holtwick shunshuiyuanxin sivakorntae celestialized ericy25 supriyoa xphillyx burgerindividual global-localhost global19 global19-atlassian-net uzbekdev1 isabella232 yonemoriyuto davijam jefersonjdk wenxiang-li kevinlights shrideh spikegee romandulman zakharyevich buynaa0720 luluoxx sheter0 ravipratap-singh iceswords theaum-org lcsouzamenezes qwb2013 1580924883 zhongmb lunzigenaocanfen webstorage119

tessdata_fast's Issues

Does tesseract 4.0 can be used offline?

Hi,
I see "network" in some your description, i'm witting an app running offline, there is no network. Does tesseract 4.0 can be used offline?

How to package tessconfigs?

I maintain the Fedora tesseract-tessdata package. I'm unsure how to deal with the tessconfigs folder when packaging. The release archive [1] contains the empty submodule folder, and some broken symlinks which point to files inside the submodule. Is it safe to just ignore the folder and these files? Is there any worth pulling the tessconfigs folder separately and packaging it?

[1] https://github.com/tesseract-ocr/tessdata_fast/archive/4.1.0/tessdata_fast-4.1.0.tar.gz

Can we use fast dataset with Java program? Is it supported

On windows 7,I am trying to use Fast Traineddata files in my java project. But I am getting Invalid Memory access when using it even after setting datapath.

I have tried to use best data files but it also gives same error. Default data files are working but it is huge file so I was going for fast files.

Tesseract tess = new Tesseract();

tess.setDatapath("C:\\Users\\U6070534\\Downloads\\tess4j\\tessdata");
tess.setLanguage("eng");

String inputFilePath = "C:\\Users\\U6070534\\IdeaProjects\\ocrsample\\screenshot\\craft0.png";
    try {
        textpath.add(tess.doOCR(new File(inputFilePath)));
    } catch (TesseractException e1) {
        e1.printStackTrace();
    }

Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:437)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:292)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:213)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:197)
at OcrReader.main(OcrReader.java:25)
Failed loading language 'eng'
Tesseract couldn't load any languages!
Process finished with exit code 1

Tesseract_fast trained data cannot be used in .NET wrapper Tesseract4.0 engine

I am using tesseract 4.0 .net wrapper dll and tying to use tesseract_fast trained data, but it is not gettng instialized and it throws an unhandled exception (read write) . I tried with engine mode "Lstmonly" and "default" but couldn't intialize. Kindly help me whether i am using in a wrong way or i should not used this fast traineddata.
Error is as follows

Trained data doesn't seem to be working

I downloaded por.traineddata (for portuguese language) to the StreamingAssets/tessdata folder and I get a "TessAPIInit failed. Output: -1". Should this be enough? I notice there are a bunch of different files for starting with eng like "eng.cube.params", "eng.cube.nn" and so many others? Should there be the same for portuguese?

Create a new tessdata_fast release/tag?

As emailed to the mailing list, does it make sense to tag another release?

> Hi,
>
> With Tesseract now switching to regular (alpha) releases of 5.0.0; does
> it make sense to consider some versioning for language files as well?
>
> The Internet Archive has switched to using Tesseract for all our OCR,
> and I'm hoping that we can record exactly what version of language files
> was used for a specific OCR job. Currently, the answer is simple, since
> we're using the default packages from Ubuntu focal, but I am working on
> switching to Tesseract release/tag 5.0.0-20201231.
>
> But the tessdata_fast (or tessdata_best, for that matter) do not seem to
> have any recent 5.x releases:
> https://github.com/tesseract-ocr/tessdata_fast/releases
>
> Are there plans to create a release/tag for the tessdata_* repositories?
>
> Cheers,
> Merlijn

And the follow-up:

On 27/01/2021 12:42, Shree Devi Kumar wrote:
>> The Internet Archive has switched to using Tesseract for all our OCR,
> 
> I am so happy to hear this. It will be great to have the Indic languages
> that were marked as non-ocrable so far be converted to text correctly on
> Internet Archive.
> 
> Is there any page with instructions to do this? Can a language be specified
> while OCRing? eg. Better results are many times received using
> script/Devanagari instead of san for Sanskrit.
> 
> Regarding your question about tessdata, there have only been minor changes
> to tessdata files but adding a tag is a good idea. I suggest you post this
> as a feature request in the repo.

Duplicate name problem with Lao / lao

Installations on case insensitive filesystems (macOS, Windows, ...) cannot install both Lao.traineddata and lao.traineddata. It might also be confusing for users to know the difference between -l Lao and -l lao.

I suggest to rename the first one. Which name would be fine?