phil-el / phetools Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 9.0 3.07 MB

Home Page: http://tools.wmflabs.org/phetools/

License: GNU General Public License v3.0

Python 99.16% CSS 0.56% PHP 0.10% HTML 0.13% Shell 0.05%

phetools's People

Contributors

Stargazers

Watchers

Forkers

michal-josef-spacek hinote-github weftwiki jayantanth sipun aubreymcfato adithyak04 pikse vladiscripts

phetools's Issues

Support additional languages

According to https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages, there are over 100 language packs available for Tesseract, but phetools only supports 24 of them. Is there any reason that only those 24 are supported? For example, there is a Kannada Wikisource, and a Kannada language pack available for Tesseract, but phetools doesn't support Kannada, so they are not able to use phetools for doing Kannada OCR. Is this just because phetools is out of date or is there some other reason that most of the Tesseract languages aren't supported?

Copying license

Please add the COPYING file (which is mentioned in the comments in sources) or other kind of license.

Redundant </div> added to page footer

E.g. here:

This </div> seems to be redundant and it shows up on Special:LintErrors as stripped-tag.

Syntax errror in OCR

Hi!

I just want to report T228594: a syntax error which prevents to use OCR tools on Wikisource. Matmarex has already investigated this bug.

Please find a full console report on T232957.

Makes sure that pdf_to_djvu does not convert twice the same file

It would allow clients to send extra cmd=convert request in order to make sure that the conversion process is started

Sortable table and SVG charts

From here: https://fr.wikisource.org/wiki/Discussion_utilisateur:Phe/2015#Deux_cadeaux_pour_tous_les_utilisateurs

Is it possible to make the general table "Wikitable-sortable" by column? It would help to make quick comparisons and charts
Graphs like this are easy to read for en.ws or fr.ws readers, but would certainly disappoint those many hy,ws, ca.ws etc. readers because many lines are cluttered in the bottom of the chart. Is it possible to render those images in vector graphics, with thin lines explorable magnifying the image (something like a prezi image)?

fast_hocr seems to be completely broken

I'm trying to use the hocr service in the Russian Wikisource (calling it with &lang='ru' from my custom js in the meantime) and it returns "unable to locate file /data/project/phetools/cache/hocr/<...>.hocr for page <...>.djvu lang ru" to the callback for a page of some djvu files.

Job state is 'success' in the hocr queue after being 'pending' for long time... I tried hocr for some pages of 3 Russian djvu files (indexes and pages in ru WS while the djvu files are in commons) -- for 1 is okay and 2 failed...

More specifically, it fails with "unable to locate ... .hocr ..." for pages in [[s:ru:Индекс:Крестовский - Петербургские золотопромышленники.djvu|this index]](jobid=9966 in the hocr queue), where it runs fast_hocr() (I checked locally -- it runs fast_hocr() with no problem, has_word_bbox() returns True) and does not fall to slow_hocr(), while it runs okay for pages in [[s:ru:Индекс:Собрание сочинений Марка Твэна (1898) т.8.djvu|this index]], where it changes text falling into slow_ocr()...

I created an offline version of the tool and debugged locally into the tool logic and fast_hocr() ''seems to be completely broken'': djvu_text_to_hocr.do_parse() produces files with names page_NNNN.html (and then page_NNNN.html.bz2 after compession) in the cache_path catalog, while get_hocr() tries to fetch files with names which look like page_NNNN.hocr and page_NNNN.hocr.bz2 from the cache_path. So, hocr seems to work when it falls into slow_hocr() method and changes text from the existing OCR layer to the text produced by tesseract in slow_hocr(). However, if fast_hocr() has already ran and placed its output files into the cache_path, it does not fall into slow_hocr() and hocr() returns nothing while get_hocr() simply fails.

We do require the working fast_hocr() functionality in order to allow ru WS users simply split the pages with existing OCR layer into paragraphes. Thanks in advance.

[[s:ru:User:Hinote|Hinote (A)]]

pdf_to_djvu_cgi.py does not find IA scan files

When IA-Upload calls pdf_to_djvu_cgi.py for file https://archive.org/details/ComeRuinareLAutoritaImage this error is returned:

400 Bad Request {"text": "invalid ia identifier, I can't locate needed files", "error": 2}

Improve error message

Please see here for details: https://phabricator.wikimedia.org/T223012

Add fast_hocr_only mode

Actually, we need a mode for fast_hocr where it does not fall into slow_hocr even if it cannot produce data from the existing OCR layer.

The tesseract software used at toollabs runs bad on Russian texts. That was the reason why the OCR gadget has not been enabled at ru.ws several years ago and is not enabled in the meantime. It (tesseract) was bad several years ago and still produces text layer with inacceptable quality...

In the same time, we would like to utilize the fast_hocr functionality -- it works brilliant on existing text layers splitting the existing texts into paragraphes...

So, as an implementation idea, you could add the '&mode=fast_hocr_only' or something like that to the URL fields recognized by the hocr web front-end and call only fast_hocr() in this case with no attempts to call slow_hocr().

Thanks in advance.

phil-el / phetools Goto Github PK

phetools's People

Contributors

Stargazers

Watchers

Forkers

phetools's Issues

Support additional languages

Copying license

Redundant </div> added to page footer

Syntax errror in OCR

Makes sure that pdf_to_djvu does not convert twice the same file

Sortable table and SVG charts

fast_hocr seems to be completely broken

pdf_to_djvu_cgi.py does not find IA scan files

Improve error message

Add fast_hocr_only mode

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent