Git Product home page Git Product logo

phetools's People

Contributors

michal-josef-spacek avatar phil-el avatar weftwiki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

phetools's Issues

Support additional languages

According to https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages, there are over 100 language packs available for Tesseract, but phetools only supports 24 of them. Is there any reason that only those 24 are supported? For example, there is a Kannada Wikisource, and a Kannada language pack available for Tesseract, but phetools doesn't support Kannada, so they are not able to use phetools for doing Kannada OCR. Is this just because phetools is out of date or is there some other reason that most of the Tesseract languages aren't supported?

Copying license

Please add the COPYING file (which is mentioned in the comments in sources) or other kind of license.

Sortable table and SVG charts

From here: https://fr.wikisource.org/wiki/Discussion_utilisateur:Phe/2015#Deux_cadeaux_pour_tous_les_utilisateurs

  • Is it possible to make the general table "Wikitable-sortable" by column? It would help to make quick comparisons and charts
  • Graphs like this are easy to read for en.ws or fr.ws readers, but would certainly disappoint those many hy,ws, ca.ws etc. readers because many lines are cluttered in the bottom of the chart. Is it possible to render those images in vector graphics, with thin lines explorable magnifying the image (something like a prezi image)?

fast_hocr seems to be completely broken

I'm trying to use the hocr service in the Russian Wikisource (calling it with &lang='ru' from my custom js in the meantime) and it returns "unable to locate file /data/project/phetools/cache/hocr/<...>.hocr for page <...>.djvu lang ru" to the callback for a page of some djvu files.

Job state is 'success' in the hocr queue after being 'pending' for long time... I tried hocr for some pages of 3 Russian djvu files (indexes and pages in ru WS while the djvu files are in commons) -- for 1 is okay and 2 failed...

More specifically, it fails with "unable to locate ... .hocr ..." for pages in [[s:ru:Индекс:Крестовский - Петербургские золотопромышленники.djvu|this index]](jobid=9966 in the hocr queue), where it runs fast_hocr() (I checked locally -- it runs fast_hocr() with no problem, has_word_bbox() returns True) and does not fall to slow_hocr(), while it runs okay for pages in [[s:ru:Индекс:Собрание сочинений Марка Твэна (1898) т.8.djvu|this index]], where it changes text falling into slow_ocr()...

I created an offline version of the tool and debugged locally into the tool logic and fast_hocr() ''seems to be completely broken'': djvu_text_to_hocr.do_parse() produces files with names page_NNNN.html (and then page_NNNN.html.bz2 after compession) in the cache_path catalog, while get_hocr() tries to fetch files with names which look like page_NNNN.hocr and page_NNNN.hocr.bz2 from the cache_path. So, hocr seems to work when it falls into slow_hocr() method and changes text from the existing OCR layer to the text produced by tesseract in slow_hocr(). However, if fast_hocr() has already ran and placed its output files into the cache_path, it does not fall into slow_hocr() and hocr() returns nothing while get_hocr() simply fails.

We do require the working fast_hocr() functionality in order to allow ru WS users simply split the pages with existing OCR layer into paragraphes. Thanks in advance.

[[s:ru:User:Hinote|Hinote (A)]]

Add fast_hocr_only mode

Actually, we need a mode for fast_hocr where it does not fall into slow_hocr even if it cannot produce data from the existing OCR layer.

The tesseract software used at toollabs runs bad on Russian texts. That was the reason why the OCR gadget has not been enabled at ru.ws several years ago and is not enabled in the meantime. It (tesseract) was bad several years ago and still produces text layer with inacceptable quality...

In the same time, we would like to utilize the fast_hocr functionality -- it works brilliant on existing text layers splitting the existing texts into paragraphes...

So, as an implementation idea, you could add the '&mode=fast_hocr_only' or something like that to the URL fields recognized by the hocr web front-end and call only fast_hocr() in this case with no attempts to call slow_hocr().

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.