Git Product home page Git Product logo

files_fulltextsearch_tesseract's Introduction

files_fulltextsearch_tesseract's People

Contributors

andyscherzinger avatar artificialowl avatar nickvergessen avatar skjnldsv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

files_fulltextsearch_tesseract's Issues

Can't OCR the PDF with image inside it

Hi,
I already updated the version of the fulltextsearch tesseract up to version 1.2.2 and it works well on images file but i doesn't work on pdf file with image inside it. Is that a bug or haven't develop that feature yet ?
Thank you

High CPU usage on multicore

I'm using Ubuntu 22.04.3, NC27.0.2 and files_fulltextsearch_tesseract 27.0.0.
My machine has 4 cores (4 threads) available, but tesseract ocr takes ages and cpu usage is at 100% for hours sometimes. While double checking with htop, I noticed that tesseract is running with multiple (4) threads but doesn't seem to get to an end ;).
Not sure since when this issue exists. Looking at #14, it seems tesseract only used one thread in the past. At least I'm affected since a couple of months by this high cpu usage.

After a short search, I stumbled over the possibility to use a thread limit (https://github.com/thiagoalessio/tesseract-ocr-for-php#thread-limit). It seems there are cases (like mine) in which tesseract is blocking itself with too many cores available (see also tesseract-ocr/tesseract#898).

Thus, I did some testing with this thread limit...
OCR Settings:
Limit PDF pages: 20
Timeout: 60 seconds

Testfile: Nextcloud Manual.pdf

I measured the runtime for this loop:

for ($i = 1; $i <= $pages; $i++) {

Runtime for all pages:
threadLimit(1): 27.84 seconds
threadLimit(2): 26.21 seconds
threadLimit(3): 20.00 seconds
threadLimit(4): more than 15 minutes (was still running while writing this issue), EDIT: It took 1119.51 seconds (18.66 minutes) to finish.

As you can see, limiting the number of threads does improve the situation a lot. Double checking on a test instance with 2 cores available shows the same results (tesseract is blocking itself when running on all available cores).

Compability for NC 26

Nextcloud 26 is out and the fulltextsearch apps are not compatible yet.

Does it work with Nextcloud 26?

Thanks and regards

How to deal with mysql connection timeouts for long OCR jobs?

A pdf with a few hundred pages broke my indexing, as after all pages were OCR'ed, the run (via fulltextsearch:document:index or via full-index) crashed with:

  [PDOException (HY000)]
  SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
 

Exception trace:
  at /var/www/html/3rdparty/doctrine/dbal/src/Driver/PDO/Statement.php:92
 PDOStatement->execute() at /var/www/html/3rdparty/doctrine/dbal/src/Driver/PDO/Statement.php:92
 Doctrine\DBAL\Driver\PDO\Statement->execute() at /var/www/html/3rdparty/doctrine/dbal/src/Connection.php:1059
 Doctrine\DBAL\Connection->executeQuery() at /var/www/html/lib/private/DB/Connection.php:261
 OC\DB\Connection->executeQuery() at /var/www/html/3rdparty/doctrine/dbal/src/Query/QueryBuilder.php:345
 Doctrine\DBAL\Query\QueryBuilder->execute() at /var/www/html/lib/private/DB/QueryBuilder/QueryBuilder.php:281
 OC\DB\QueryBuilder\QueryBuilder->execute() at /var/www/html/lib/private/Comments/Manager.php:419
 OC\Comments\Manager->getForObject() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:820
 OCA\Files_FullTextSearch\Service\FilesService->updateCommentsFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:812
 OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:741
 OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:657
 OCA\Files_FullTextSearch\Service\FilesService->generateDocumentFromIndex() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:705
 OCA\Files_FullTextSearch\Service\FilesService->updateDocument() at /var/www/html/apps/files_fulltextsearch/lib/Provider/FilesProvider.php:314
 OCA\Files_FullTextSearch\Provider\FilesProvider->updateDocument() at /var/www/html/apps/fulltextsearch/lib/Command/DocumentIndex.php:112

So, the updateDocument seems to run into mysql connection timeouts during the main loop over the pdf pages. Limiting the pdf pages I can ocr 20 pages, but at 100 it timeouts. I presume the connection timeout is at around 5mins or 10mins.

So how do I deal with this problem?

  • I figure I could increase the mysql connection timeout in the Nextcloud settings, but I'd rather not, as this would impact a whole lot more apps/core possibly negatively, especially since the connection timeout ocr needs would be around 2 hours for 1000 pdf pages...
  • Ideally the "main loop" could ping the database connection in TesseractService.php:#L278, but as I don't see a database connection anywhere here, I presume this is handled in the general occ code. So is this even touchable in the app?
  • I don't want to limit my whole FTS to < 20 pdf pages, which also depends on the --psm and on the general load of the server and will lead to random errors when indexing. I have a few hundred users dealing with policymaking involving big pdfs so ideally, it would not be necessary to limit pdf pages at all...

Cron spams errors if an encrypted PDF indexed

This is a sample pdf file: sample_encrypted.pdf

It was obtained with:

$ wget https://classics.berkeley.edu/sites/default/files/2020-01/sample.pdf
$ pdftk sample.pdf output sample_encrypted.pdf owner_pw xyz user_pw abc

If you place this in an indexed directory, the cron service will spew out quite unhelpful mails to your admin every fifteen minutes:

**** This file requires a password for access.
Error: /invalidfileaccess in pdf_process_Encrypt
Operand stack:

Execution stack:
%interp_exit .runexec2 --nostringval-- runpdf --nostringval-- 2 %stopped_push --nostringval-- runpdf runpdf false 1 %stopped_push 1992 1 3 %oparray_pop 1991 1 3 %oparray_pop 1979 1 3 %oparray_pop 1980 1 3 %oparray_pop runpdf runpdf runpdf runpdf false 1 %stopped_push
Dictionary stack:
--dict:728/1123(ro)(G)-- --dict:1/20(G)-- --dict:80/200(L)-- --dict:80/200(L)-- --dict:133/256(ro)(G)-- --dict:315/325(ro)(G)-- --dict:29/32(L)--
Current allocation mode is local
GPL Ghostscript 9.27: Unrecoverable error, exit code 1

Index image metadata

Hallo

It would be nice to index image metadata, such as description, location, photographer.
This would help to navigate large image collections.
Unfortunately there is no proper metadata management in Nextcloud at all. :-(

This could be done in this app, or in a knock-off.
What are your suggestions?

I have not at all understood the mechanics of the full-text search machine.
Can there be two Content Providers for the same file type, e.g. this app and a new app?

If I do a knock-off from this app, are there any technical debts/considerations that I need to look out for?

Similar request: nextcloud/fulltextsearch#413

Nextcloudd 20

Hello, is there an update planned for nextcloud 20?

feature request: scan multiple files at once

i am currently indexing all my files and i cant help but notice that tesseract runs single threaded.

would it be possible to spin up a configurable amount of tesseract threads to do n files simultaneously? i have a quadcore processor, but i assume there are people out there with a lot more.

Can't OCR the PDF with image inside it

It works fine for pictures but it does not work for PDFs with Pictures inside.

Ubuntu 18.04
Nextcloud Version: 18.04
Fulltext Search: 1.4.1
Fulltext Elastic Search: 1.5.1
Full text Search Tesseract OCR: 1.4.1
PHP Version: 7.4.5
Memory Limit: 512 MB
Max Execution Time: 3600

Tesseract installed in /usr/share/tesseract-ocr/4.00

Tested with a TIFF File and it works.

Followed the blog post and installed ImageMagick 6.0 and allowed it to read PDF in the policymap.

I get no error when creating the index.

cannot save "Enable OCR" setting

I installed Full text search - Files - Tesseract OCR (BETA) 0.2.0 today. Since then I cannot save the "Enabled OCR" setting on the global settings page. This is with safari on a Mac. The JavaScript console show

Refused to execute a script because its hash, its nonce, or 'unsafe-inline' does not appear in the script-src directive of the Content Security Policy.

by reloading the settings page. With Chrome this strangely enough seems to work and it even convinces the subsystem to really use tesseract... :-)

What is strange: When OCR is not enabled the two fields below it should be gray but they are not in Safari.... I get the same error with the other settings pages as well in Safari so it might not be related.

Version 20.0.1 Installation Error

Nextcloud displays the following error message during installation:

Client error: GET https://github.com/daita/files_fulltextsearch_tesseract/releases/download/v20.0.1/files_fulltextsearch_tesseract-20.0.1.tar.gz resulted in a 404 Not Found response: Not Found.

Does anyone have an idea?

[Question] Will there be a version suitable for Nextcloud 28 Hub 7?

Will there be a version suitable for Nextcloud 28 Hub 7?

I've been really happy with this app in the past! Kudos to the maintainers πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘

So hopefully there will be a version for Nextcloud 28 and upcoming.

Sorry, but I'm not a PHP guy. If there is any testing necessary I will try it.

Best Regards

MF

Support for Nextcloud v19

Updater for NC19 RC3 says :

This app has no update available for NC 19

Can you please have a look?

Thank you

Tesseract OCR on scaned PDFs

Hello,

in Nextcloud it is not possible to index pdf content from scaned dokuments. The reason for this is the pdf file format itself. When you scan a document and save it to pdf there is no "real text layer". So for Nextclouds Plugin "Full Text Search" it is not possible to index the content.

With Tesseract it is only possible to find content in images. For this reason my question:

Can you add some functions and libarys to convert pdf to image. The process should be like this:

  1. User uploads a pdf
  2. This Plugin starts work
  3. the pdf will convert to an image
  4. tesseract will analyse the content
  5. tesseract will safe the image to an new pdf (with same name)
echo (new TesseractOCR('img.png'))
    ->quiet()
    ->run();

This Code snippet will save the image to an pdf and add the searchable text layers to it.

After this process it is possible for users to search for all pdfs.

Error! The command "" was not found.

Hello,

Nextcloud 20.0.4
files_fulltextsearch_tesseract 20.0.1

It hasn't worked for a while (I'm not sure anymore). Of course tessaract is installed on my machine and it worked very well. I am surprised by the empty command in : Error! The command "" was not found.

Here are my logs:

{"reqId":"EvFIwwPPicpCuO4U0fWn","level":1,"time":"2020-12-17T12:02:29+00:00","remoteAddr":"","user":"--","app":"files_fulltextsearch_tesseract","method":"","url":"--","message":{"Exception":"thiagoalessio\\TesseractOCR\\TesseractNotFoundException","Message":"Error! The command \"\" was not found.\n\nMake sure you have Tesseract OCR installed on your system:\nhttps://github.com/tesseract-ocr/tesseract\n\nThe current $PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","Code":0,"Trace":[{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php","line":26,"function":"checkTesseractPresence","class":"thiagoalessio\\TesseractOCR\\FriendlyErrors","type":"::"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":217,"function":"run","class":"thiagoalessio\\TesseractOCR\\TesseractOCR","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":194,"function":"ocrFileFromPath","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":169,"function":"ocrFile","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":121,"function":"extractContentUsingTesseractOCR","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Listeners/GenericListener.php","line":84,"function":"onFileIndexing","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/ServiceEventListener.php","line":76,"function":"handle","class":"OCA\\Files_FullTextSearch_Tesseract\\Listeners\\GenericListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":251,"function":"__invoke","class":"OC\\EventDispatcher\\ServiceEventListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":73,"function":"callListeners","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":86,"function":"dispatch","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":98,"function":"dispatch","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":123,"function":"dispatchTyped","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":90,"function":"dispatch","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":713,"function":"fileIndexing","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":658,"function":"updateContentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":573,"function":"updateFilesDocumentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":621,"function":"generateDocumentFromIndex","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php","line":291,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php","line":407,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Command/Live.php","line":308,"function":"updateDocument","class":"OCA\\FullTextSearch\\Service\\IndexService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Command/Live.php","line":261,"function":"liveCycle","class":"OCA\\FullTextSearch\\Command\\Live","type":"->"},{"file":"/var/www/nextcloud/apps/mail/vendor/symfony/console/Command/Command.php","line":258,"function":"execute","class":"OCA\\FullTextSearch\\Command\\Live","type":"->"},{"file":"/var/www/nextcloud/core/Command/Base.php","line":169,"function":"run","class":"Symfony\\Component\\Console\\Command\\Command","type":"->"},{"file":"/var/www/nextcloud/apps/mail/vendor/symfony/console/Application.php","line":920,"function":"run","class":"OC\\Core\\Command\\Base","type":"->"},{"file":"/var/www/nextcloud/apps/mail/vendor/symfony/console/Application.php","line":266,"function":"doRunCommand","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/apps/mail/vendor/symfony/console/Application.php","line":142,"function":"doRun","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/lib/private/Console/Application.php","line":215,"function":"run","class":"Symfony\\Component\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/console.php","line":100,"function":"run","class":"OC\\Console\\Application","type":"->"},{"file":"/var/www/nextcloud/occ","line":11,"args":["/var/www/nextcloud/console.php"],"function":"require_once"}],"File":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/FriendlyErrors.php","Line":48,"CustomMessage":"{\"path\":\"\\/DIR\\/nextcloud\\/data\\/DIR\\/USER\\/DIR\\/FILE.jpg\",\"cmd\":\"\\\"tesseract\\\" \\\"\\/DIR\\/data\\/DIR\\/files\\/USER\\/DIR\\/FILE.jpg\\\" \\\"\\/tmp\\/ocrccbv8n\\\" --psm 3 -l fra 2> \\/dev\\/null\",\"lang\":[\"fra\"]}"},"userAgent":"--","version":"20.0.3.2"}

Can not enable on NextCloud 21

Hi all,

I am trying to introduces this plugin on a fresh installation of NextCloud v21.0.2

As part of my process, I use the app store to download and enable:

  • Full text search
  • Full text search - Elasticsearch Platform
  • Full text search - Files

However, when I try to download and enable "Full text search - Files - Tesseract OCR" I get this error:

Client error: `GET https://github.com/daita/files_fulltextsearch_tesseract/releases/download/v20.0.1/files_fulltextsearch_tesseract-20.0.1.tar.gz` resulted in a `404 Not Found` response: Not Found

I get around this by manually going into /var/html/www/custom_apps/ and manually downloading the latest version for the GitHub release using wget and unpacking it myself. But when I try to enable it, I get an error An error occured during the request. Unable to proceed..

Checking the logs I keep getting this:

[PHP] Error: require_once(): Failed opening required '/var/www/html/custom_apps/files_fulltextsearch_tesseract/lib/AppInfo/../../vendor/autoload.php' (include_path='/var/www/html/3rdparty/pear/archive_tar:/var/www/html/3rdparty/pear/console_getopt:/var/www/html/3rdparty/pear/pear-core-minimal/src:/var/www/html/3rdparty/pear/pear_exception:/var/www/html/apps:/var/www/html/custom_apps') at /var/www/html/custom_apps/files_fulltextsearch_tesseract/lib/AppInfo/Application.php#45

I tried looking at the line mentioned, var/www/html/custom_apps/files_fulltextsearch_tesseract/lib/AppInfo/Application.php#45. The output is:

require_once __DIR__ . '/../../vendor/autoload.php';

But this seems consistent with other fulltextsearch plugins. So I don't know why this specific app is acting up. Any thoughts?

Exception "time limit exceeded" with occ and PDF files

While PDF option is disabled, everything can be indexed just fine including OCR from image files. When enabling PDF support and also following the instructions on how to configure PHP-imagick, I do get this exception when indexing on the CLI with occ although the time limit for PHP is set to unlimited:

php7.2: time limit exceeded `Operation canceled' @ fatal/cache.c/GetImagePixelCache/1795.

Any idea what else could be causing this?

Invalid arguments, undefined offset, disabled for security reasons

Hi,

for new files I suddenly get:

{"reqId":"SkqwZwGpyArStZ61eiLV","level":3,"time":"2020-09-07T11:26:25+00:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"/occ","message":"join(): Invalid arguments passed at /usr/local/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php#18","userAgent":"--","version":"19.0.2.2","id":"5f561b8b25433"}

{"reqId":"SkqwZwGpyArStZ61eiLV","level":3,"time":"2020-09-07T11:26:25+00:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"/occ","message":"exec() has been disabled for security reasons at /usr/local/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php#17","userAgent":"--","version":"19.0.2.2","id":"5f561b8b25459"}

{"reqId":"SkqwZwGpyArStZ61eiLV","level":3,"time":"2020-09-07T11:26:25+00:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"/occ","message":"Undefined offset: 1 at /usr/local/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/Command.php#47","userAgent":"--","version":"19.0.2.2","id":"5f561b8b2547d"}

{"reqId":"SkqwZwGpyArStZ61eiLV","level":3,"time":"2020-09-07T11:26:25+00:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"/occ","message":"Trying to access array offset on value of type null at /usr/local/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/Command.php#47","userAgent":"--","version":"19.0.2.2","id":"5f561b8b254a0"}

Any clue where to look? This appears to be new since my upgrade to NC19...

Parallelize Tesseract

The single-threaded Tesseract causes quite the bottleneck. Can we have an admin setting that allows users to pick the number of parallel processes when running FTS with OCR enabled?

Return value must be type string, bool returned

Value:

OCA/Files_FullTextSearch_Tesseract/
Service/TesseractService::getAbsolutePath()

Could be in relation to elasticsearch 6.8?
All other index generation works.

But how can i debug this ... ?
Thanks a lot

Limit the number of pages for PDF's

As there is no real control what users are storing in their Nextcloud it'll be very handy to have the ability to limit the number of pages for a PDF to be OCRed.

Stored books with i.e. >500 pages will cause Tesseract to take forever to OCR them.

The ability to set a "max pages" parameter would be very handy. It could be configurable if the OCR is then limited to this number of pages (i.e. a 500 pages document with a max_pages of 20 will then still OCR the first 20 pages) or if PDF's with more pages than the limit will be ignored at all.

occ commands

Hi,
I am about to edit/create a bash file for installing fulltextsearch for nextcloud and I want to include tesseract.
I have succesfully installed fulltextsearch.
I have installed the nextcloud apps fulltextsearch, fulltextsearch_elasticsearch, files_fulltextsearch.
I ran sudo apt-get install -y tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng on my ubuntu vm and installed the app files_fulltedtsearch_tesseract in nextcloud with the help of my script.
Now I want to enable files_fulltextsearch_tesseract and specify the settings with the help of an occ command like this:
sudo -u www-data php /var/www/nextcloud/occ files_fulltextsearch_tesseract:configure "{\"tesseract_enabled\":\"1\",\"tesseract_lang\":\"eng,deu\"}"
Unfortunatley this is not working. The message I get looks like this:
grafik
Then I tried it like displayed in the message:
sudo -u www-data php /var/www/nextcloud/occ files_fulltextsearch:configure "{\"tesseract_enabled\":\"1\",\"tesseract_lang\":\"eng,deu\"}". It gives me the following in my console:
grafik

I have also had a look at the database. The oc_appconfig table in nextcloud_db database does not contain any tesseract setting after installing the app via command line. Three records regarding tesseract are added after enabling the app via the webinterface. But I still can't change them on the command line.

So, the questions are: Is there a namespace defined for this app? What is the namespace? How can I enable the app and change the settings from command line / in a bash script?

Thanks for your answer in advance!

Log is full with errors since update to 20.0.0

Hello i am using nextcloud 20.0.3 with the latest update of fulltextsearch_tesseract.
My log is getting spammed with those messages:
{"reqId":"VYwzcnRY1cPYmF6ehKDV","level":3,"time":"2020-12-11T21:45:50+01:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"--","message":{"Exception":"Error","Message":"Trying to get property 'useFileAsOutput' of non-object at /var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php#42","Code":0,"Trace":[{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php","line":42,"function":"onError","class":"OC\\Log\\ErrorHandler","type":"::"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":217,"function":"run","class":"thiagoalessio\\TesseractOCR\\TesseractOCR","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":194,"function":"ocrFileFromPath","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":169,"function":"ocrFile","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":121,"function":"extractContentUsingTesseractOCR","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Listeners/GenericListener.php","line":84,"function":"onFileIndexing","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/ServiceEventListener.php","line":76,"function":"handle","class":"OCA\\Files_FullTextSearch_Tesseract\\Listeners\\GenericListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":251,"function":"__invoke","class":"OC\\EventDispatcher\\ServiceEventListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":73,"function":"callListeners","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":86,"function":"dispatch","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":98,"function":"dispatch","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":123,"function":"dispatchTyped","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":90,"function":"dispatch","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":713,"function":"fileIndexing","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":658,"function":"updateContentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":573,"function":"updateFilesDocumentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":621,"function":"generateDocumentFromIndex","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php","line":291,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php","line":407,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Cron/Index.php","line":134,"function":"updateDocument","class":"OCA\\FullTextSearch\\Service\\IndexService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Cron/Index.php","line":102,"function":"liveCycle","class":"OCA\\FullTextSearch\\Cron\\Index","type":"->"},{"file":"/var/www/nextcloud/lib/private/BackgroundJob/Job.php","line":52,"function":"run","class":"OCA\\FullTextSearch\\Cron\\Index","type":"->"},{"file":"/var/www/nextcloud/lib/private/BackgroundJob/TimedJob.php","line":59,"function":"execute","class":"OC\\BackgroundJob\\Job","type":"->"},{"file":"/var/www/nextcloud/cron.php","line":127,"function":"execute","class":"OC\\BackgroundJob\\TimedJob","type":"->"}],"File":"/var/www/nextcloud/lib/private/Log/ErrorHandler.php","Line":91,"CustomMessage":"--"},"userAgent":"--","version":"20.0.3.2","id":"5fd46f08efa2c"}

{"reqId":"VYwzcnRY1cPYmF6ehKDV","level":3,"time":"2020-12-11T21:45:50+01:00","remoteAddr":"","user":"--","app":"PHP","method":"","url":"--","message":{"Exception":"Error","Message":"Trying to get property 'executable' of non-object at /var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php#26","Code":0,"Trace":[{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/vendor/thiagoalessio/tesseract_ocr/src/TesseractOCR.php","line":26,"function":"onError","class":"OC\\Log\\ErrorHandler","type":"::"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":217,"function":"run","class":"thiagoalessio\\TesseractOCR\\TesseractOCR","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":194,"function":"ocrFileFromPath","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":169,"function":"ocrFile","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":121,"function":"extractContentUsingTesseractOCR","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch_tesseract/lib/Listeners/GenericListener.php","line":84,"function":"onFileIndexing","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/ServiceEventListener.php","line":76,"function":"handle","class":"OCA\\Files_FullTextSearch_Tesseract\\Listeners\\GenericListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":251,"function":"__invoke","class":"OC\\EventDispatcher\\ServiceEventListener","type":"->"},{"file":"/var/www/nextcloud/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":73,"function":"callListeners","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":86,"function":"dispatch","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/lib/private/EventDispatcher/EventDispatcher.php","line":98,"function":"dispatch","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":123,"function":"dispatchTyped","class":"OC\\EventDispatcher\\EventDispatcher","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":90,"function":"dispatch","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":713,"function":"fileIndexing","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":658,"function":"updateContentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":573,"function":"updateFilesDocumentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php","line":621,"function":"generateDocumentFromIndex","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->"},{"file":"/var/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php","line":291,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php","line":407,"function":"updateDocument","class":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Cron/Index.php","line":134,"function":"updateDocument","class":"OCA\\FullTextSearch\\Service\\IndexService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/nextcloud/apps/fulltextsearch/lib/Cron/Index.php","line":102,"function":"liveCycle","class":"OCA\\FullTextSearch\\Cron\\Index","type":"->"},{"file":"/var/www/nextcloud/lib/private/BackgroundJob/Job.php","line":52,"function":"run","class":"OCA\\FullTextSearch\\Cron\\Index","type":"->"},{"file":"/var/www/nextcloud/lib/private/BackgroundJob/TimedJob.php","line":59,"function":"execute","class":"OC\\BackgroundJob\\Job","type":"->"},{"file":"/var/www/nextcloud/cron.php","line":127,"function":"execute","class":"OC\\BackgroundJob\\TimedJob","type":"->"}],"File":"/var/www/nextcloud/lib/private/Log/ErrorHandler.php","Line":91,"CustomMessage":"--"},"userAgent":"--","version":"20.0.3.2","id":"5fd46f08efaa8"}

How to "search" files which's OCR

Hi,
I already install fulltextsearch_tesseract, enable it in Full Text Search and copy language files into tessdata but How can i know which images indexed or not ? And how can I search the content inside the picture ?
Thank you

Can PDFs be "OCRed" with this plugin?

Hello.

I can't find any documentation about which file types are "OCRed" by tessereact-ocr and this plugin. So far JPG files are automatically processed in my tests, but scanned documents as PDF are not. I don't know if it is a software limitation, wrong configuration or bug.

Are PDFs supposed to be processed by this plugin? If no, will it do it in the future or is a limitation of tesseract-ocr?

Would be nice to be able to index existing PDF documents .

Thank you very much.

dependencies

Hey,
please put ghostscript on the dependency list for version 1.2.2. It took me some time to figure our that the ghostscript packet is missing in the nextcloud docker container.
best, Steffen

Wiki: path of tessdata folder

The wiki says:

copy language files into /usr/share/tessdata/

However the default tessdata folder, at least under Ubuntu, is

/usr/share/tesseract-ocr/tessdata/

Admin have not selected any IFullTextSearchPlatform in v 24

This old error again, I guess.

In fact, I've never managed to get tesseract search to function because, despite having it installed, I cannot select it in the "Full text search" options. I also cannot usually view the OCR options below unless I first install Full text search - Files - Tesseract OCR followed by the Full text search apps. Even then it sometimes only appears for a split second. Weird.

Error in nextcloud log
Exception while cronIndex: OCA\FullTextSearch\Exceptions\PlatformNotSelectedException - Admin have not selected any IFullTextSearchPlatform

Test.

sudo -u wwwrun php /srv/www/htdocs/nextcloud/occ fulltextsearch:test

Testing your current setup: Creating mocked content provider. ok Testing mocked provider: get indexable documents. (2 items) ok Loading search platform. fail In PlatformService.php line 196: Admin have not selected any IFullTextSearchPlatform

When I could view the OCR options I added several languages...

sudo -u wwwrun php /srv/www/htdocs/nextcloud/occ fulltextsearch:check

Full text search 24.0.0 No Search Platform available - Content Providers: Files 24.0.0 { "files_local": "1", "files_external": "0", "files_group_folders": "0", "files_encrypted": "0", "files_federated": "0", "files_size": "20", "files_pdf": "1", "files_office": "1", "files_image": "0", "files_audio": "0", "files_fulltextsearch_tesseract": { "version": "24.0.0", "enabled": "1", "psm": "4", "lang": "eng, chi_sim, chi_tra, jpn", "pdf": "1", "pdf_limit": "0" } }

Tesseract is installed in /usr/bin/ and the training data is in /usr/share/tessdata/

Edit: in the Admin panel I can add Elasticsearch after installing the app for it, but tesseract is not available.

Smart handling of PDFs

Hi,

Thank you for creating this plugin! I was wondering how the lengthy OCR process could be improved, especially if you have a mix of image-only PDFs and PDFs that were created from e.g. Word documents.

PDF-converted Word documents are indexed just fine even without this plugin. If it is possible to omit OCRing them that could yield a nice speed improvement. OCRing PDFs that contain only images (I like to scan all the snail mail I get) require OCR before indexing though.

I do not know whether PDFs are submitted to the index in raw and OCR'ed form (i.e. twice), but even if they are the following proposal might be possible to implement:

  1. Submit PDF without OCRing first
  2. Were terms extracted?
  3. If yes: Mark document as indexed, goto 5
  4. If no: OCR, submit again
  5. Move on to next document

Of course, that process requires "enable the OCR of PDF" to be checked.

If Full-Text Search could implement the above it could save a great amount of time and perhaps also improve the index accuracy because perfectly fine text-containing documents would not have to be OCR'ed before they are indexed.

[idea] adding settings to enable/disable image files ocr

Now, ocr on PDFs is optional. It'd be nice if operations on JPEGs are also optional.

Why: be consistent and let users choose file types to be ocr'ed

My use-case: Some users tend to upload a lot of photos, which take up significant resource to ocr, while the number of scanned documents are fewer than the photos, they are arguably more useful to be ocr'ed and searchable.

Error indexing

Bonjour πŸ‘‹πŸ»

Link to issue:
nextcloud/fulltextsearch#699

When Tesseract is enable. Index not working. And seaching in nextcloud not working too.

An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php:814
Stack trace:
#0 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(747): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(727): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(618): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /var/www/html/custom_apps/files_fulltextsearch/lib/Provider/FilesProvider.php(288): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /var/www/html/custom_apps/fulltextsearch/lib/Service/IndexService.php(315): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /var/www/html/custom_apps/fulltextsearch/lib/Service/IndexService.php(195): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /var/www/html/custom_apps/fulltextsearch/lib/Command/Index.php(416): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'comacho', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /var/www/html/custom_apps/fulltextsearch/lib/Command/Index.php(279): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /var/www/html/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /var/www/html/core/Command/Base.php(168): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /var/www/html/3rdparty/symfony/console/Application.php(1009): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/3rdparty/symfony/console/Application.php(273): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/3rdparty/symfony/console/Application.php(149): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/lib/private/Console/Application.php(211): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/console.php(99): OC\Console\Application->run()
#15 /var/www/html/occ(11): require_once('/var/www/html/c...')
#16 {main}root@pc04:~# docker exec --user www-data nextcloud-8f54cee5-fb75-9d4d-16b5-5bc0af808fd0 php occ fulltextsearch:index

When is disabled, it works.

Tried with tessaract 4 / 5

Help!
😒

NC 15

Hello,

Can you Update this Plugin? It's not compatible with NC 15.

tesseract should run with lower priority

Hi!
I have installed the tesseract app 1.4.2 and noticed very hig CPU usage with real time priority (20)
I strongly suggest to run background jobs with lower priority to keep the CPU free for time critical online tasks

avoid OCR on non-image PDF files

It appears that this app uses OCR even if the PDF file is not a scanned-type.

For example, I have a fresh Nextcloud installation and I see php occ fulltextsearch:index taking a lot of time processing Nextcloud Manual.pdf (a 99 pages PDF that comes with Nextcloud) and tesseract is working hard scanning it... That's simply useless.

I would suggest checking if the PDF contains and text nodes and avoid Tesseract in that case.

smartphone NextCloud App - fulltextsearch returning no results

Hi - I have installed a NextCloud server. The server has all the updates applied and your fulltext search add-on applied. Using a browser and logged into our NextCloud server I am able to successfully perform fulltext searchs on .txt .docx. .xls and .pdf documents.

To clarify if I was searching for the string 'fox' and it was within the content of the mentioned files types (.txt .docx. .xls and .pdf) I would get a successful search.

If I use a smartphone with the NextCloud app installed and used the search feature. I would not get any successful searches.

Is there a way to get this working? Who can I approach? If there is a cost involved would still be very interested in going forward.

Thanks Gio

Add Ukrainian localisation

Please, add the Ukrainian localisation. Here is the archive with the js and json files, just place them in the folder l10n of the package.
uk.zip

If PDF is enabled, any PDFs that error during /spatie/pdf-to-image coversion will be DELETED & LOST.

PROBLEM:
When PDF is selected in "Files - Tesseract OCR" options, if the elastic indexing task encounters any PDF's that ghostscript (used by /spatie/pdf-to-image which is used by this app) considers "bad", then those files will be deleted and lost during the failed conversion process.

Severity:
Critical if you enable [x] PDF within the app. Because you can not guarantee that users will not upload pdf's which ghostscript considers "bad". If they do, they will be deleted and lost during indexing.

More details
In my case I had tesseract PEM set to 12 and limit PDF pages set to 10, though neither setting should matter here.

The error is thrown during indexing (when PDF is enabled in Files - Tesseract OCR app) by ghostscript is something like:

**** Error: stream operator isn't terminated by valid EOL. Output may be incorrect.

If you search for this "warning" from Ghostscript, you can see that many people have encountered it over time. Which means that many different "PDF" creation libraries potentially may cause it to occcur. In our case, I believe it is caused by whatever NAPS2 (https://github.com/cyanfish/naps2) is using to save as pdf.

Suggested Solution for Files - Tesseract OCR

We can not assume pdf-to-image was successfull. Preserve source/input .pdf until it is confirmed that an OCR-scanned PDF of the source file has been generated.

"Exception":"ImagickException","Message":"NoDecodeDelegateForThisImageFormat `PDF' @ error/constitute.c/ReadImage/572","Code":420

Just tried to test my playbook to setup up NC20 and files_fulltextsearch / files_fulltextsearch_tesseract. During the task: docker exec --user www-data nextcloud php occ fulltextsearch:index I'm getting the following error message. This happens during indexing the file Documents/Nextcloud flyer.pdf

    An unhandled exception has been thrown:
    Error: Call to a member function getContent() on string in /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:719
    Stack trace:
    #0 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(658): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
    #1 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(638): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
    #2 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(529): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
    #3 /var/www/html/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(268): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
    #4 /var/www/html/apps/fulltextsearch/lib/Service/IndexService.php(317): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
    #5 /var/www/html/apps/fulltextsearch/lib/Service/IndexService.php(204): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
    #6 /var/www/html/apps/fulltextsearch/lib/Command/Index.php(410): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'admin', Object(OCA\FullTextSearch\Model\IndexOptions))
    #7 /var/www/html/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
    #8 /var/www/html/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #9 /var/www/html/core/Command/Base.php(169): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #10 /var/www/html/3rdparty/symfony/console/Application.php(1000): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #11 /var/www/html/3rdparty/symfony/console/Application.php(271): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #12 /var/www/html/3rdparty/symfony/console/Application.php(147): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #13 /var/www/html/lib/private/Console/Application.php(215): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
    #14 /var/www/html/console.php(100): OC\Console\Application->run()
    #15 /var/www/html/occ(11): require_once('/var/www/html/c...')
    #16 {main}
  stdout_lines: <omitted>

in the logs I found the following entry:

{"reqId":"UmZIrh7qxXLxJzh3Bbam","level":1,"time":"2021-01-06T20:42:25+00:00","remoteAddr":"","user":"--","app":"files_fulltextsearch_tesseract","method":"","url":"--","message":{"Exception":"ImagickException","Message":"NoDecodeDelegateForThisImageFormat `PDF' @ error/constitute.c/ReadImage/572","Code":420,"Trace":[{"file":"/var/www/html/apps/files_fulltextsearch_tesseract/vendor/spatie/pdf-to-image/src/Pdf.php","line":40,"function":"pingimage","class":"Imagick","type":"->","args":["/var/nc-data/admin/files/Documents/Nextcloud flyer.pdf"]},{"file":"/var/www/html/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":249,"function":"__construct","class":"Spatie\\PdfToImage\\Pdf","type":"->","args":["/var/nc-data/admin/files/Documents/Nextcloud flyer.pdf"]},{"file":"/var/www/html/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":165,"function":"ocrPdf","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->","args":["*** sensitive parameter replaced ***","*** sensitive parameter replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch_tesseract/lib/Service/TesseractService.php","line":121,"function":"extractContentUsingTesseractOCR","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->","args":["*** sensitive parameter replaced ***","*** sensitive parameter replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch_tesseract/lib/Listeners/GenericListener.php","line":84,"function":"onFileIndexing","class":"OCA\\Files_FullTextSearch_Tesseract\\Service\\TesseractService","type":"->","args":[{"__class__":"OCP\\EventDispatcher\\GenericEvent"}]},{"file":"/var/www/html/lib/private/EventDispatcher/ServiceEventListener.php","line":76,"function":"handle","class":"OCA\\Files_FullTextSearch_Tesseract\\Listeners\\GenericListener","type":"->","args":[{"__class__":"OCP\\EventDispatcher\\GenericEvent"}]},{"file":"/var/www/html/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":251,"function":"__invoke","class":"OC\\EventDispatcher\\ServiceEventListener","type":"->","args":[{"__class__":"OCP\\EventDispatcher\\GenericEvent"},"OCP\\EventDispatcher\\GenericEvent",{"__class__":"Symfony\\Component\\EventDispatcher\\EventDispatcher"}]},{"file":"/var/www/html/3rdparty/symfony/event-dispatcher/EventDispatcher.php","line":73,"function":"callListeners","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->","args":[[{"__class__":"Closure"}],"OCP\\EventDispatcher\\GenericEvent",{"__class__":"OCP\\EventDispatcher\\GenericEvent"}]},{"file":"/var/www/html/lib/private/EventDispatcher/EventDispatcher.php","line":86,"function":"dispatch","class":"Symfony\\Component\\EventDispatcher\\EventDispatcher","type":"->","args":[{"__class__":"OCP\\EventDispatcher\\GenericEvent"},"OCP\\EventDispatcher\\GenericEvent"]},{"file":"/var/www/html/lib/private/EventDispatcher/EventDispatcher.php","line":98,"function":"dispatch","class":"OC\\EventDispatcher\\EventDispatcher","type":"->","args":["OCP\\EventDispatcher\\GenericEvent",{"__class__":"OCP\\EventDispatcher\\GenericEvent"}]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":123,"function":"dispatchTyped","class":"OC\\EventDispatcher\\EventDispatcher","type":"->","args":[{"__class__":"OCP\\EventDispatcher\\GenericEvent"}]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/ExtensionService.php","line":90,"function":"dispatch","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->","args":["Files_FullTextSearch.onFileIndexing",{"file":"*** sensitive parameter replaced ***","document":"*** sensitive parameter replaced ***"}]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php","line":713,"function":"fileIndexing","class":"OCA\\Files_FullTextSearch\\Service\\ExtensionService","type":"->","args":["*** sensitive parameter replaced ***","*** sensitive parameter replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php","line":658,"function":"updateContentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php","line":638,"function":"updateFilesDocumentFromFile","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php","line":529,"function":"updateFilesDocument","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameters replaced ***"]},{"file":"/var/www/html/apps/files_fulltextsearch/lib/Provider/FilesProvider.php","line":268,"function":"generateDocument","class":"OCA\\Files_FullTextSearch\\Service\\FilesService","type":"->","args":["*** sensitive parameter replaced ***"]},{"file":"/var/www/html/apps/fulltextsearch/lib/Service/IndexService.php","line":317,"function":"fillIndexDocument","class":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider","type":"->","args":["*** sensitive parameter replaced ***"]},{"file":"/var/www/html/apps/fulltextsearch/lib/Service/IndexService.php","line":204,"function":"indexDocuments","class":"OCA\\FullTextSearch\\Service\\IndexService","type":"->","args":[{"__class__":"OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform"},{"__class__":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider"},[],{"__class__":"OCA\\FullTextSearch\\Model\\IndexOptions"}]},{"file":"/var/www/html/apps/fulltextsearch/lib/Command/Index.php","line":410,"function":"indexProviderContentFromUser","class":"OCA\\FullTextSearch\\Service\\IndexService","type":"->","args":[{"__class__":"OCA\\FullTextSearch_Elasticsearch\\Platform\\ElasticSearchPlatform"},{"__class__":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider"},"admin",{"__class__":"OCA\\FullTextSearch\\Model\\IndexOptions"}]},{"file":"/var/www/html/apps/fulltextsearch/lib/Command/Index.php","line":273,"function":"indexProvider","class":"OCA\\FullTextSearch\\Command\\Index","type":"->","args":[{"__class__":"OCA\\Files_FullTextSearch\\Provider\\FilesProvider"},{"__class__":"OCA\\FullTextSearch\\Model\\IndexOptions"}]},{"file":"/var/www/html/3rdparty/symfony/console/Command/Command.php","line":255,"function":"execute","class":"OCA\\FullTextSearch\\Command\\Index","type":"->","args":[{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/core/Command/Base.php","line":169,"function":"run","class":"Symfony\\Component\\Console\\Command\\Command","type":"->","args":[{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/3rdparty/symfony/console/Application.php","line":1000,"function":"run","class":"OC\\Core\\Command\\Base","type":"->","args":[{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/3rdparty/symfony/console/Application.php","line":271,"function":"doRunCommand","class":"Symfony\\Component\\Console\\Application","type":"->","args":[{"__class__":"OCA\\FullTextSearch\\Command\\Index"},{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/3rdparty/symfony/console/Application.php","line":147,"function":"doRun","class":"Symfony\\Component\\Console\\Application","type":"->","args":[{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/lib/private/Console/Application.php","line":215,"function":"run","class":"Symfony\\Component\\Console\\Application","type":"->","args":[{"__class__":"Symfony\\Component\\Console\\Input\\ArgvInput"},{"__class__":"Symfony\\Component\\Console\\Output\\ConsoleOutput"}]},{"file":"/var/www/html/console.php","line":100,"function":"run","class":"OC\\Console\\Application","type":"->","args":[]},{"file":"/var/www/html/occ","line":11,"args":["/var/www/html/console.php"],"function":"require_once"}],"File":"/var/www/html/apps/files_fulltextsearch_tesseract/vendor/spatie/pdf-to-image/src/Pdf.php","Line":40,"CustomMessage":"{\"document\":{\"id\":\"19\",\"providerId\":\"files\",\"access\":{\"ownerId\":\"admin\",\"viewerId\":\"\",\"users\":[],\"groups\":[],\"circles\":[],\"links\":[]},\"modifiedTime\":1609965688,\"title\":\"Documents\\/Nextcloud flyer.pdf\",\"link\":\"\",\"index\":{\"ownerId\":\"\",\"providerId\":\"files\",\"source\":\"files_local\",\"documentId\":\"19\",\"lastIndex\":1609965745,\"errors\":[],\"errorCount\":0,\"status\":28,\"options\":{\"_files_pdf\":\"1\",\"_files_local\":\"1\"}},\"source\":\"files_local\",\"info\":{\"share_names\":{\"admin\":\"Documents\\/Nextcloud flyer.pdf\"}},\"hash\":\"\",\"contentSize\":498680,\"tags\":[],\"metatags\":[],\"subtags\":[],\"more\":[],\"excerpts\":[],\"score\":\"\"}}"},"userAgent":"--","version":"20.0.4.0"}

This happens with elasticsearch:7.6.2 and 7.9.3.

BTW: elasticsearch:7.10.1 doesn't build at all.

Support for Nexcloud 25

Nextcloud 25 is out a while and the fulltextsearch apps are currently blocking my update ...

PDF Image Extraction does not auto-rotate landscape pages

This may be a problem with tesseract, or a setting that can be applied when creating the instance to ocr as an option -- not sure if that is even the best place to address the issue to be honest. I found in PDF documents which contain scanned images, if the image rotation is incorrect, which in the case of english is LTR, the OCR does not work and thereby indexing does not happen. Im thinking if I recall correctly this is usually a function of any decent scanner software, to duplex and auto-rotate pages properly. I only tested 90 degree rotation clockwise. Mirrored page scans probably would have the same problem if 180 degree ( upside down ) embedded images in PDF also fail. I have only been able to test by printing PDF images with microsoft pdf printer which does not auto-rotate the images and bullzip which corrects the rotation on pages. Im eager to test with my office scanner to see if rotation is handled well in the cannon scanning process.

If not for any other reason I post this to help educate others on what is acceptable input for ocr work. If it was a sideways text object I suspect it would have worked, but it simply does not work if the text is not displayed LTR in the document for embedded images.

Feature request: .noocr analog to .noindex?

I have a large photo library which I do not want to be OCR'd. Would be great if I could just put a .noocr file into the directory and then fulltextsearch would index the dir but not use OCR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.