jonaswinkler / paperless-ng Goto Github PK

A supercharged version of paperless: scan, index and archive all your physical documents

Home Page: https://paperless-ng.readthedocs.io/en/latest/

License: GNU General Public License v3.0

CSS 0.16% HTML 11.39% JavaScript 0.16% Shell 1.31% Python 53.46% Dockerfile 0.25% PostScript 12.42% TypeScript 19.01% SCSS 1.86%

dms document-management-system full-text-search machine-learning django angular ocr archiving search

paperless-ng's People

Contributors

Stargazers

Watchers

Forkers

samotelf bauerj qcasey moztr c0nsultant brouberol jonmuell jovandeginste marcobuster sisao zjean uniteddiversity mwesten knorrkator ullas-aithal bpereto eeble mhc03 wargio hulzer aviatorsd stupidgoats patzi sblop dimakuz gmwubs linuxrichard papermerge msoftware spoker joelnordell bolkoschreiber n-klotz apocalypticoctopus smaktacular holzhannes darmiel skuzzle rost314 isaacsando luminus vulpecula-nl tkuennen benjaminfrank bart1909 bizzguy davidlama sbrl candry wswv gillesgagniard emanuelstanciu nharvey3dk seevali nangal burghcoder whitehattux sophophilix pyxisah allmedicalexperts muued cmer puuu steviehs amenk 0x417 kaiman1234 mlibby markays mavjs simi55 colinfrei moonseppel jingmouren mtlynch achaisir softwarehistorysociety mjwcodr balajiramachandran magnetic6 kinkir ebardsley piyushmalik02 alryaz toxix zeroset capcharlie arif-basri braman09 thecooltechguy jajp777 mweimerskirch shtrom 64r brandonmcclure flakebi racingdodo pewter77 fignew hubitor-forks

paperless-ng's Issues

Correspondent / Type / Tag management pages: Sorting and filtering

The lists get unmanageable otherwise with >100 entries.

Warning when closing documents/paperless when documents have unsaved edits.

Require the user to confirm closing a document / closing paperless when there are unsaved edits.

This is certainly nice to have. However, you usually don't do a whole lot of editing of the documents, so even if you should forget to save, you won't loose all that much time.

Notable things/possible improvements, coming from Docspell

Hey Jonas, this absolutely looks amazing! I would have loved to see this half a year ago, especially the dashboard idea seems to be great!

I am kind of torn between Docspell and paperless-ng now, having just added all my files to Docspell, but really like your UI!

I was wondering what you think of the following things, which I think Docspell offers better:

Tags, Correspondent, etc shown in overview
I would say this is a must for me to not only filter using them but also see the attached metadata in search results and so on. Having this is also makes sense to only offer edit and not show details. Right now I have to open edit mode to see metadata
Okay, just figured out (see last point) that tags are shown really well and the correspondent is made a prefix of the title, which I like less (looks weird when the title also uses colon).
Plus it would be nice if those properties were clickable
Show document in the browser
right now there is just the ability to see the document in the edit mode. Clicking on the preview could for instance lead to the document itself (I think this is something, Paperless (original) offers as well)
Concerned person
I really really like that I can add the information to whome a letter had been addressed: me, my wife, both of us, which of the children.
nested tags or tag categories
This just helps a lot with organizing. Right now one has to workaround it by using some concerned-me, concerned-kid1, concerned-wife logic
Direction of item
I guess most people mainly this type of app for letters in a broader sense - and sometimes I just find it really useful to also throw in the things I send out, e.g. some application for public funds or something like that
Syntax help
Right now it only suggests the property, not the operator from what I saw. I loved the expert search from the original paperless!
Corresponding person (detail of corresponding organization)
I have to admit that I use this not that often so far, but I like the idea :-)
User-specific filename
Is this still possible like in original paperless?
MariaDB/MySQL support
is this planned?
Multi-user support as already mentioned in #52
multiple files per item
I like the idea to sometimes put things together, which belong together. An alternative could be to add relations between items!

And please add a check if leaving edit mode with unsaved changes!

Remove File type checks from the backend.

Don't have a limited selection of file types on the Document model. Don't check for file types when uploading new documents.

Rather, check the validity of a file type by checking for available parsers of that type.

document_exporter: overwrites exported files if filenames clash

When documents are exported with document_exporter, and two documents with the same generated filename exist in the database, the exporter will only export one of these files and references it in multiple documents in the manifest.

Django Q with ORM broker?

I was wondering if you decided to use redis as default broker because you had issues with the Django ORM broker?

I would have guessed that the ORM broker is good enough for the majority of paperless setups and it would remove the need to run redis.

Permanent data (such as saved views) should be saved on the server.

Nested tags

Ability to have nested tags, for example

bank accounts
- bank 1
  - account 1
  - account 2
- bank 2
  - account 3

This would allow easy filtering by multiple tags. Filter by bankaccount, and see documents of all accounts. Filter by account 2, and only documents from that account show up.

Any input on whether this is a good idea or what else to use it for is greatly appreciated!

Remove GnuPG.

As stated in the documentation, it provides no security at all, since:

key is stored along with encrypted documents
Paperless provides transparent access to encrypted documents
Plain text information is stored in the database, including complete document contents, which contains all sensitive information this encryption was supposed to protect.

Removing this feature will decrease the code complexity in many places.

Correspondent / Type / Tag selectors only show up to 25 entries

Since they are using the same paginated API.

Docker Image: use a more recent version of binary dependencies.

Paperless-ng uses tesseract 4.0.0. It gets the job done, but a more recent version would be nice. Same goes for ghostscript, magick, etc.

The issue is as follows.

Paperless-ng requires numpy.
In order to install paperless on the Raspberry Pi within a reasonable time (I've heard people use this on the Pi), numpy needs to be pulled from https://www.piwheels.org/. For this to work, python-3.7 is required, since there is no python-3.8 wheel for numpy on ARM. Building the wheel on Pi requires several additional dependencies and many hours of compilation time.
Therefore, a base image with python-3.7 has to be used, which will also imply that older versions of other binaries get installed.

Any pointers on how to fix this situation while still using python 3.7 are appreciated.

Make an Android app that allows you to share any documents on your mobile with paperless.

Essentially allowing consumption with any of these mobile scanner apps.

Update the documentation.

If this project gets a lot of attention, I'll have to update the documentation. Many things changed and are not valid anymore.

Automatic tag and correspondent colors

Inspired by what for example Grafana does (and due to the limited set of colors available in paperless), I patched it to auto-color tags and correspondents with a https://github.com/zenozeng/color-hash like implementation:

I understand that people might like choosing the colors themselves, so feel free to close this. 😄 It's just that I don't care about the actual color of tags, I just want them to have different ones and I don't want to bothered with selecting them.

The change_storage_type script is entirely busted.

It just does not work with the new file handling logic that came into paperless earlier this year.

It will crash and leave your document archive in an intermediate state, from which no automatic recovery is possible.

Delay consumption of new files

I might be a little too nervous but I'm just too excited. I just tested direct scanning into the consumption directory, which leads to the following warning in the log:

`11/25/20, 10:02 PM WARNING Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!

11/25/20, 10:02 PM INFO Consuming doc20201125222904.pdf`

The stdout log of the webserver container shows the following:
`22:02:09 [Q] INFO Enqueued 1
22:02:09 [Q] INFO Process-1:1 processing [doc20201125222904.pdf]
Consuming doc20201125222904.pdf
Parser: RasterisedDocumentParser based on mime type application/pdf
Generating thumbnail for doc20201125222904.pdf...
Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /usr/src/paperless/src/../consume/doc20201125222904.pdf[0] /tmp/paperless/paperless-3vtrk65p/convert.png
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.

Requested FirstPage is greater than the number of pages in the file: 0
No pages will be processed (FirstPage > LastPage).
convert-im6.q16: no images defined /tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258. Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml! **** Error: Cannot find a 'startxref' anywhere in the file. Output may be incorrect. **** Error: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** However, the output may be incorrect. **** Error: Trailer dictionary not found. Output may be incorrect. No pages will be processed (FirstPage > LastPage). Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /tmp/paperless/paperless-3vtrk65p/gs_out.png /tmp/paperless/paperless-3vtrk65p/convert.png convert-im6.q16: unable to open image /tmp/paperless/paperless-3vtrk65p/gs_out.png': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `/tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258.
Deleting directory /tmp/paperless/paperless-3vtrk65p
22:02:09 [Q] ERROR Failed [doc20201125222904.pdf] - Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png'] : Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 49, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/usr/src/paperless/src/../consume/doc20201125222904.pdf[0]', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 132, in try_consume_file
thumbnail = document_parser.get_optimised_thumbnail()
File "/usr/src/paperless/src/documents/parsers.py", line 168, in get_optimised_thumbnail
return self.optimise_thumbnail(self.get_thumbnail())
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 73, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 138, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

Is it possible to delay the consumption of new files while they're being written to avoid this?

Find and delete duplicates

Hi (again),

One cool feature would be to compare document contents and identify duplicates, maybe upon document consumption and/or in a separate workflow.

Once again, great kudos for the work, keep that up!

Document consumer won't see renamed files.

Some scanners like to write files as "something.pd~", which does not get picked up by the consumer, and then rename the file. The consumer ignores these files.

Also: The consumer should expect the files to have moved before consuming. Implement checks.

Preserve changes to opened documents

When opening a document, switching between any other page and the document detail page will revert any unsaved changes.

Sidebar buttons to close documents

Close all seems really useful.
Maybe limit the documents to 10 and sort them by most recently edited.

Consumer should not assign parsed dates from the future.

Clickable tags and correspondents.

Ability to click on tags and correspondents, which applies an instant filter to show only the selected tag/correspondent.

make the frontend usable on mobile.

Several things to do:

menu collapse doesn't work
many layouts are too wide for mobile
login does not scroll

Open documents: Documents should stay open when reloading the page.

Right now they are just gone.

Adjustable page size for the document list.

Display document metadata in the UI

such as PDF metadata (application, PDF version, author, creator, etc. Anything that's available.)

Classifier Log output shows some scary summary data.

1393 documents, 11 tag(s), 1392 correspondent(s), 1392 document type(s).

investigate.

PAPERLESS_FORCE_SCRIPT_NAME does not work.

This is required to run Paperless on a sub path, i.e. localhost:8000/paperless/.

This is generally required if you wish to use a proxy server with many different services.

Right now, this does not work.

As for the front end, the src/documents/static/index.html contains a <base href="/">, which has to be adapted to href="/paperless" or similar.

paperless-ng/src/documents/templates/index.html

Line 10 in 59bc467

This is easy since the index page is served as a django view and template context is available. See

paperless-ng/src/documents/views.py

Line 67 in 59bc467

class IndexView(TemplateView):

The front end further has some static assets for logos, which are always pointing towards <host>/assets/image.png. This is unsolved right now.

Furthermore, login/logout paths have to respect this path, as well as the admin/ link on the front end.

I don't need this, but if anyone cares about this, feel free to look into it.

Tesseract Parser: Convert fails to conver PDFs with too many pages to images due to resource limits.

Traceback on classifier.predict_tags()

First of all: Thanks for the awesome work and all the effort you put into this!

I wanted to try the classifier with parts of my data and ran into a traceback on consumption

Traceback (most recent call last):
  File "/usr/src/paperless/src/documents/consumer.py", line 171, in try_consume_file
    document_consumption_finished.send(
  File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 177, in send
    return [
  File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 178, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "/usr/src/paperless/src/documents/signals/handlers.py", line 127, in set_tags
    matched_tags = matching.match_tags(document.content, classifier)
  File "/usr/src/paperless/src/documents/matching.py", line 36, in match_tags
    predicted_tag_ids = classifier.predict_tags(document_content)
  File "/usr/src/paperless/src/documents/classifier.py", line 224, in predict_tags
    tags_ids = self.tags_binarizer.inverse_transform(y)[0]
  File "/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py", line 1017, in inverse_transform
    if yt.shape[1] != len(self.classes_):
IndexError: tuple index out of range

I did also see a warning during "create_classifier" run, which I bluntly ignored ofc. 😄

270 documents, 1 tag(s), 1 correspondent(s), 0 document type(s).
Vectorizing data...
Training tags classifier...
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
Training correspondent classifier...
There are no document types. Not training document type classifier.

Tags from consumer sub directories

I made a small patch to allow me to set tags based on sub directories of the consumer directory (jayme-github/paperless#2) and I would like to know if you would accept that feature to your fork.
Looking at the feature-ocrmypdf branch you switched back to using INotify again, so my patch would also bring back the ability to run a recursive consumer.

Let me know what you think and I'll rebase against whatever branch makes sense.

Preview document in browser

Ability to preview a document in the browser, full screen, without downloading it.

better tags input

Limit tesseract to 1 thread.

Each tesseract process launches multiple threads that do OCR on a single page in parallel. This marginally improves speed.

However: paperless processes multiple pages in parallel (up to cpu count). This results in a massive performance decrease.

Fix by settings OMP_THREAD_LIMIT=1.

E-Mail consumption: rework and make it actually usable.

don't delete consumed emails.
rather, mark consumed emails as consumed (there are IMAP labels for that.)
allow multiple accounts.
allow filters.
allow documents in the inbox to have the same filename.

Announce the status of the consumer / task queue on the front page

I.e., running / crashed.

This is easy since we're using supervisord, and there's an RCP api for getting the status.

Also see http://supervisord.org/configuration.html#program-x-section-settings, serverurl for figuring out if we're running with supervisord.

Update the Dockerfile to use a more recent version of ubuntu as the base.

python-slim is based on debian buster, which contains the outdated 4.0.0 version of tesseract.

We cannot use alpine, because python-numpy takes ages to install on this image, since its being compiled from source due to incompatible libraries.

Filter: Should not be able to select multiple filters for the same filter rule.

With one exception: Multiple tag selection should be allowed.

Document filter: Filter by document type not working

The document type selector is not shown

Open documents: Long document titles screw up the layout.

Trigger re-parsing or auto-assignment of consumed documents

Hi there,

first of all - I'm really impressed by the work you did on paperless. I've noticed the issue you opened on the original paperless project and then quickly decided to give it a go. The resource optimizations and feature enhancements you made are quite impressive. Wow!

One thing I'm kind of missing though is - I've now imported a bunch of PDF documents. It all went fine but I went on creating correspondents, document types and tags on-the-fly. Now I've got the most common information in there and I'd like to re-run auto assignment on my existing documents. I didn't (yet) find it in the awesome documentation. Can this somehow be accomplished?

Thank you so much for the great work you're doing with this project!

Cascading filters.

When filtering for a specific correspondent, subsequent filters for tags or types should only display options which are still available.

Support for docx, doc, odf documents.

Integration with OCRmyPDF

Add the ocr'ed text as a text layer to the scanned documents so that text can be copied from them.

Implement proper navigation between documents

When editing documents, the "Next button" should always reflect the next document in the view that was previously visible (I.e. Search results, document list, saved view) and should reload data as necessary.

There should also be a previous button

Log scrolling fails

With response: "Enter a valid date/time."

Request URL:

/api/logs/?page=1&page_size=25&ordering=-created&created__lt=2020-11-11T22:13:57.749728+01:00&level__gte=20

Change saved filters

Maybe I understand the idea wrong but I would be kind of stuck once I saved a view. When I open it again, I cannot change the filters, the sort order or anything else. This is bad from my point of view for two reasons:

You have to start from scratch in order to just adapt a small thing
you cannot use a dashboard for further drilling down into your documents

Idea would be to maybe

just use the normal controls for adding/changing filters and order
add some form of save / save as option

Implement a sanity checker

That

checks if all originals are in place
checks if the checksums match
checks if there are unreferenced files
checks if all thumbnails are there
checks if the mime type of all documents matches

Basically a program that's executed in the background once in a while and tells you that everything is all right.

Text file consumption: Thumbnail generation needs to be fixed.

The code is not working for documents with special characters and the thumbnails look very bad.

Make an update checker

That checks with GitHub if a new release is available.