naptha / tesseract.js Goto Github PK
View Code? Open in Web Editor NEWPure Javascript OCR for more than 100 Languages 📖🎉🖥
Home Page: http://tesseract.projectnaptha.com/
License: Apache License 2.0
Pure Javascript OCR for more than 100 Languages 📖🎉🖥
Home Page: http://tesseract.projectnaptha.com/
License: Apache License 2.0
Hi, thanks for porting tesseract to client site OCR – awesome!
I have one serious problem. I created a simple <input type="file" capture="camera" accept="image/*">
which holds an image and tesseract.js works perfectly when running on a desktop browser.
As soon as I try to use it with the same picture on mobile safari (ios 9) the result is a horrible messy string.
I tried using jquery 1.x and 2.x, both returning the same messy result on the iphone, again same picture, same codebase works when accessing on desktop browsers
Please take a look at my pull request #70
Lack of proper documentation. Unable to use this. Please add a Documentation or a proper example so that it can be used.
I guys, I really the idea of this library, how powerful it is and easy to use. But I was taking a look at the codebase and I realised that there's no current testsuite.
So I think will be great if we can setup an initial testsuite, we can discuss the tools to use on it, Im pretty much happy which any runner or assertion library but since this has to work both in Node.js and in the browser, maybe Karma-runner will be a good fit for it? Also a lot of developers are used to work with it and gives you some cool things out of the box :).
Also I think it's important for other developers that want to contribute to the project, because right now they don't know if their changes are actually breaking any of the current behaviours.
Later we can easily integrate the testsuite in Travis and have fast feedback on any change 🚀.
What do you think? Thanks!
dear,
the 'langPath' I can't open, location is not work.
maybe need another CDN path.
in china
Tesseract.js crashes on safari on the iphone 6. We suspect this is a memory issue and are looking into ways to work around it.
Im building mobile app for Motivation qoutes, where any one can add qoutes and send image link that contains qoutes in it. I used the following image.
it gives me text with proper line breaks, happy to see this.
"YOU LEARNED
TO LAUGH
BEFORE
YOU LEARNED
TO TALK."
It gives me following text.
w?-
3 <5"
I! r
Just want to ask if the lib only works with image without any background.
Module is only usable with node v6.8.0+ with const, let variable declaration support. A note of this should be included in the README. I'll begin working on making it compatible with other node versions using babel.
Modify the directory permission: sudo chmod 755 /media/mtp;
Hi,
Does "Nothing to see here" in your project description mean that your project is not ready yet? If not, will this be the correct repo to watch for a version of Tesseract that will work in the browser (via Emscripten)?
Dear,
Please add persian language to supported languages.
Thanks in advance
When trying to keep everything local, especially when no internet access is available, for the externally hosted traineddata language packs, things do not function (obviously).
If we pull the tessdata, where do we include it in order for it to be preloaded without having the call go out? There is a mention of an environment variable, (tessdata_prefix) but that appears to be a leave-over from the original c.
Where can we stick the tessdata relative and specify that in the worker.js file?
edit: Well, I've found a way via web worker messaging and attaching a new parameter (rootURL) to the base tesseractinit object, then modifying the xhr connect string to take the rooturl as the base of the path. It works really well and really makes this flexible.
Not sure if it'd be something you want me to fork and pull request back to you though.
Using the sample code provided in the readme, I'm not seeing output of any kind. The process just seems to hang.
I created a sample repo. To reproduce:
git clone https://github.com/zeke/ocr-test/
cd ocr-test
npm install
npm start
My setup:
❯ node -v
v6.8.1
❯ uname -a
Darwin C02R41WSFVH8 15.6.0 Darwin Kernel Version 15.6.0: Thu Jun 23 18:25:34 PDT 2016; root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64
❯ npm ls tesseract.js
ocr-test@ /Users/zeke/Desktop/ocr-test
└── [email protected]
PS Out of curiosity, why is node>=6.8 required?
This library seems promising and very capable. The demo page is great but the documentation is severely lacking - so lacking that I've wondered if it even works.
It's a shame that such a good library risk not getting used by people since they won't understand it without looking through code.
It would be useful if we added support to accept an image URL?
Hey guys!! I wonder if you have considered bringing support for frameworks like react-native through node. I was working on a tesseract wrapper for react-native but your lib looks much better. (Considering that now the wrapper is only implemented on android)
So, I tryed to create a test using yours but I'm getting this error
i'm excited about the project ,this project does support uyghur language?
GET http://localhost:1234/master/worker/worker.js net::ERR_CONNECTION_REFUSED
My code:
const path = require('path');
const Tesseract = require('tesseract.js');
const myImage = path.resolve(__dirname, 'out.png');
Tesseract.recognize(myImage)
.progress(message => console.log(message))
.catch(err => console.error(err))
.then(result => console.log(result))
.finally(resultOrError => console.log(resultOrError))
Issues:
/Users/xuyan/Documents/my/node/login/node_modules/tesseract.js/src/index.js:15
class TesseractWorker {
^^^^^
SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode
at exports.runInThisContext (vm.js:53:16)
at Module._compile (module.js:404:25)
at Object.Module._extensions..js (module.js:432:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:313:12)
at Module.require (module.js:366:17)
at require (module.js:385:17)
at Object.<anonymous> (/Users/xuyan/Documents/my/node/login/testPng.js:3:19)
at Module._compile (module.js:425:26)
at Object.Module._extensions..js (module.js:432:10)
I seem to be having performance issues when using tesseract.js on node in combination with a jpeg. When I run the basic.js example using npm installed modules instead of the local modules I get pretty good results: "Benchmark took 3034.437842 miliseconds"
However as soon as I run the same example using a jpeg of 912 × 2121 px the results are very poor:
"Benchmark took 41944.719605 miliseconds"
If I run the same jpeg image in the browser example it gives the same performance-ish as the .png on node. I think it has something to do with the way loadImage is implemented in the node.js version but I haven been able to pin it down.
Edit: The delay is past the loading of the image itself and is probably related to the interprocess communication. The decoded raw array that is being sent is huge. This doesn't cause a delay in the browser however.
It's a really cool idea for Tesseract to work completely on the browser.
Hi, why don't you create a bower.json file to use with bower ? Normaly npm packages are for backend and bower for frontend, any problem if I send a PR ?
Would be awesome if you removed the circular references in the result object. The references make it impossible to easily convert to a JSON object.
http://codepen.io/anon/pen/EgAObE?editors=1111
tesseract.js:356 Uncaught DataCloneError: Failed to execute 'postMessage' on 'Worker': An object could not be cloned.
npm ERR! fetch failed https://registry.npmjs.org/tesseract.js/-/tesseract.js-1.0.9.tgz
So I'm using this code provided by @zeke, English text gets recognized fine but this error appears whenever I am using the Arabic recognizer.
const path = require('path')
const Tesseract = require('tesseract.js')
const file = require('fs').readFileSync(path.join(__dirname, 'arabic-3.png'))
Tesseract.recognize(file, {lang: 'ara'})
.progress(message => console.log(message))
.catch(err => console.error(err))
.then(result => console.log(result))
/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:4
function f(a){throw a;}var h=void 0,i=!0,j=null,k=!1;function aa(){return function(){}}function ba(a){return function(){return a}}var n,Module;Module||(Module=eval("(function() { try { return TesseractCore || {} } catch(e) { return {} } })()"));var ca={},da;for(da in Module)Module.hasOwnProperty(da)&&(ca[da]=Module[da]);var ea=i,fa=!ea&&i;
^
abort(5) at Error
at Error (native)
at Na (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:32:26)
at ka (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:507:108)
at Array.LHa (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:402:25912)
at Qpa (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:388:44877)
at kpa (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:388:23029)
at jpa (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:388:22303)
at lT (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:387:80568)
at mT (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:387:80700)
at Array.BS (/Users/macbook/Development/projects/nodejs/ocr-test/node_modules/tesseract.js/node_modules/tesseract.js-core/index.js:387:69011)
If this abort() is unexpected, build with -s ASSERTIONS=1 which can give more information.
Cannot enlarge memory arrays. Either (1) compile with -s TOTAL_MEMORY=X with X higher than the current value 100663296, (2) compile with ALLOW_MEMORY_GROWTH which adjusts the size at runtime but prevents some optimizations, or (3) set Module.TOTAL_MEMORY before the program runs.
Cannot enlarge memory arrays. Either (1) compile with -s TOTAL_MEMORY=X with X higher than the current value 100663296, (2) compile with ALLOW_MEMORY_GROWTH which adjusts the size at runtime but prevents some optimizations, or (3) set Module.TOTAL_MEMORY before the program runs.
/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:4 function f(a){throw a;}var h=void 0,i=!0,j=null,k=!1;function aa(){return function(){}}function ba(a){return function(){return a}}var n,Module;Module||(Module=eval("(function() { try { return TesseractCore || {} } catch(e) { return {} } })()"));var ca={},da;for(da in Module)Module.hasOwnProperty(da)&&(ca[da]=Module[da]);var ea=i,fa=!ea&&i;
abort("Cannot enlarge memory arrays. Either (1) compile with -s TOTAL_MEMORY=X with X higher than the current value 100663296, (2) compile with ALLOW_MEMORY_GROWTH which adjusts the size at runtime but prevents some optimizations, or (3) set Module.TOTAL_MEMORY before the program runs.") at Error at Error (native) at Na (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:32:26) at ka (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:507:108) at Function.pb (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:12:26) at vd (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:331:190) at UFa (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:396:56010) at WEa (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:396:39452) at Gra (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:388:78184) at Mpa (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:388:42487) at Rpa (/Test-Projects/tesseract.js/node_modules/tesseract.js-core/index.js:388:45819) If this abort() is unexpected, build with -s ASSERTIONS=1 which can give more information.
I copied the <script src='https://cdn.rawgit.com/naptha/tesseract.js/1.0.8/dist/tesseract.js'></script>
from the README but it threw error 404. I couldn't access the file neither via wget or my browser.
What helps is if you change the 1.0.8
part to 0.2.0
which should be correct version of tesseract.js according to your releases. So to fix this issue I propose either add correct file so the url works or change the readme to 0.2.0
url.
Working URL: https://cdn.rawgit.com/naptha/tesseract.js/0.2.0/dist/tesseract.js
Hi @antimatter15,
I'm wondering if we could host this lib on CDNJS, looks like there are some static codes point to "https://cdn.rawgit.com/naptha/tesseract.js/", do you have any idea about how we should do?
Thank.
I'm debugging my app and get a EADDRINUSE error when Tesseract.recognize is called. I hunted this down and I think its because you don't specify execArgv. See the following issue others have with debugging and forking in general:
As it stands, I can't debug to inspect the result.
node v6.8.1 supports:
Console {
log: [Function: bound ],
info: [Function: bound ],
warn: [Function: bound ],
error: [Function: bound ],
dir: [Function: bound ],
time: [Function: bound ],
timeEnd: [Function: bound ],
trace: [Function: bound trace],
assert: [Function: bound ],
Console: [Function: Console] }
Tesseract fails on line 123 when it attempts to call the debug method.
var Tesseract = require('tesseract.js');
var myImage = './screenshot_05.png';
Tesseract.detect(myImage)
.progress(function(message){
console.log('progress is: ', message)
});
~/code/tesseract_test $ node index.js
progress is: { status: 'loading tesseract core' }
progress is: { status: 'loaded tesseract core' }
progress is: { status: 'initializing tesseract', progress: 0 }
pre-main prep time: 67 ms
progress is: { status: 'initializing tesseract', progress: 1 }
progress is: { status: 'loading osd.traineddata', progress: 1 }
Number of blobs post-filtering = 91
Number of blobs to try = 91
/Volumes/HOME/Users/pj/code/tesseract_test/node_modules/tesseract.js/src/index.js:123
if(this._resolve.length === 0) console.debug(data);
^
TypeError: console.debug is not a function
at TesseractJob._handle (/Volumes/HOME/Users/pj/code/tesseract_test/node_modules/tesseract.js/src/index.js:123:43)
at TesseractWorker._recv (/Volumes/HOME/Users/pj/code/tesseract_test/node_modules/tesseract.js/src/index.js:71:21)
at ChildProcess.<anonymous> (/Volumes/HOME/Users/pj/code/tesseract_test/node_modules/tesseract.js/src/node/index.js:14:18)
at emitTwo (events.js:106:13)
at ChildProcess.emit (events.js:191:7)
at process.nextTick (internal/child_process.js:744:12)
at _combinedTickCallback (internal/process/next_tick.js:67:7)
at process._tickCallback (internal/process/next_tick.js:98:9)
<html>
<head>
</head>
<body>
<img src="dream.jpg" class="to_ocr">
<div class="prog"><div>
<canvas class="display"></canvas>
<script src="http://tenso.rs/tesseract.js"></script>
<script src="https://code.jquery.com/jquery-2.2.1.min.js"></script>
<script>
var img = $('body').find(".to_ocr");
Tesseract.recognize( img, { progress: 'prog', lang: 'eng'} )
.then( 'display' );
</script>
</body>
</html>
getting:
Uncaught (in promise) DOMException: Failed to execute 'postMessage' on 'Worker': An object could not be cloned.
Hi all,
I'm Peter from @cdnjs, I'm going to host this awesome js tool on cdnjs.com, but just found that the versioning is v1.0.x on npm but v0.x.0 on GitHub, do you have any suggestion about which one should I use?
Thanks,
Peter
If I want use my own trained data, How should I do?
Hi.
first, good work!
second: is it possible to use tesseract.js directly from javascript without a web worker? Perhaps an earlier build output gives that prior to being web-worker-ified(?)
Thanks.
is there a way to detect multiple language?
Would be awesome to be able to init before invoking .recognize
Incorrectly work recognize function in polish language.
My test image:
Tesseract-text (marked with bad words, only a few of the beginning)
Śród takrcn pól pned raty nad brzegrem ruczaru
Na pagórku mewrerkrm, we brzozowym garu
Sial dwór szlachecki, z dnewa, recz podmurawany
Śwrecrny się z daleka panrelane ściany
Tym brelsze ze odbrte od cremner zrelem
Tupoh, co gu bronią od wiatrów Jesrem
Dom mreszkalny ruewrelkrlecz zewsząd cnęaogrI stodołę mar wrerka
r plzy me] trzy stogr\n\n
Tesseract recognize, differs significantly from the text in the image. The letters are converted to the other by what words does unintelligible string.
by adding beerpay's badge, people will be know how to support this great project easiest.
thanks for this great work guys!
cheers!
Can you please add instructions on how you used emscripten to compile the project from tesseract-ocr/tesseract? I am working on a project like this and this can be very informative. Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.