Git Product home page Git Product logo

jonaswinkler / paperless-ng Goto Github PK

View Code? Open in Web Editor NEW
5.4K 53.0 358.0 18.77 MB

A supercharged version of paperless: scan, index and archive all your physical documents

Home Page: https://paperless-ng.readthedocs.io/en/latest/

License: GNU General Public License v3.0

CSS 0.16% HTML 11.39% JavaScript 0.16% Shell 1.31% Python 53.46% Dockerfile 0.25% PostScript 12.42% TypeScript 19.01% SCSS 1.86%
dms document-management-system full-text-search machine-learning django angular ocr archiving search

paperless-ng's People

Contributors

addadi avatar ahyear avatar bastianpoe avatar belonias avatar bmsleight avatar c0nsultant avatar ckut avatar dadosch avatar danielquinn avatar ddddavidmartin avatar diveflo avatar ekw avatar erikarvstedt avatar jat255 avatar jenspfeifle avatar jonaswinkler avatar maphy-psd avatar markschmitt avatar masterofjokers avatar matthewmoto avatar muued avatar ovv avatar pitkley avatar puuu avatar sbrunner avatar shamoon avatar strubbl avatar tido- avatar tikitu avatar transifex-integration[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paperless-ng's Issues

Notable things/possible improvements, coming from Docspell

Hey Jonas, this absolutely looks amazing! I would have loved to see this half a year ago, especially the dashboard idea seems to be great!

I am kind of torn between Docspell and paperless-ng now, having just added all my files to Docspell, but really like your UI!

I was wondering what you think of the following things, which I think Docspell offers better:

  1. Tags, Correspondent, etc shown in overview
    I would say this is a must for me to not only filter using them but also see the attached metadata in search results and so on. Having this is also makes sense to only offer edit and not show details. Right now I have to open edit mode to see metadata
    Okay, just figured out (see last point) that tags are shown really well and the correspondent is made a prefix of the title, which I like less (looks weird when the title also uses colon).
    Plus it would be nice if those properties were clickable
  2. Show document in the browser
    right now there is just the ability to see the document in the edit mode. Clicking on the preview could for instance lead to the document itself (I think this is something, Paperless (original) offers as well)
  3. Concerned person
    I really really like that I can add the information to whome a letter had been addressed: me, my wife, both of us, which of the children.
  4. nested tags or tag categories
    This just helps a lot with organizing. Right now one has to workaround it by using some concerned-me, concerned-kid1, concerned-wife logic
  5. Direction of item
    I guess most people mainly this type of app for letters in a broader sense - and sometimes I just find it really useful to also throw in the things I send out, e.g. some application for public funds or something like that
  6. Syntax help
    Right now it only suggests the property, not the operator from what I saw. I loved the expert search from the original paperless!
  7. Corresponding person (detail of corresponding organization)
    I have to admit that I use this not that often so far, but I like the idea :-)
  8. User-specific filename
    Is this still possible like in original paperless?
  9. MariaDB/MySQL support
    is this planned?
  10. Multi-user support as already mentioned in #52
  11. multiple files per item
    I like the idea to sometimes put things together, which belong together. An alternative could be to add relations between items!
  • And please add a check if leaving edit mode with unsaved changes!

Remove File type checks from the backend.

Don't have a limited selection of file types on the Document model. Don't check for file types when uploading new documents.

Rather, check the validity of a file type by checking for available parsers of that type.

Django Q with ORM broker?

I was wondering if you decided to use redis as default broker because you had issues with the Django ORM broker?

I would have guessed that the ORM broker is good enough for the majority of paperless setups and it would remove the need to run redis.

Nested tags

Ability to have nested tags, for example

  • bank accounts
    • bank 1
      • account 1
      • account 2
    • bank 2
      • account 3

This would allow easy filtering by multiple tags. Filter by bankaccount, and see documents of all accounts. Filter by account 2, and only documents from that account show up.

Any input on whether this is a good idea or what else to use it for is greatly appreciated!

Remove GnuPG.

As stated in the documentation, it provides no security at all, since:

  • key is stored along with encrypted documents
  • Paperless provides transparent access to encrypted documents
  • Plain text information is stored in the database, including complete document contents, which contains all sensitive information this encryption was supposed to protect.

Removing this feature will decrease the code complexity in many places.

Docker Image: use a more recent version of binary dependencies.

Paperless-ng uses tesseract 4.0.0. It gets the job done, but a more recent version would be nice. Same goes for ghostscript, magick, etc.

The issue is as follows.

  • Paperless-ng requires numpy.
  • In order to install paperless on the Raspberry Pi within a reasonable time (I've heard people use this on the Pi), numpy needs to be pulled from https://www.piwheels.org/. For this to work, python-3.7 is required, since there is no python-3.8 wheel for numpy on ARM. Building the wheel on Pi requires several additional dependencies and many hours of compilation time.
  • Therefore, a base image with python-3.7 has to be used, which will also imply that older versions of other binaries get installed.

Any pointers on how to fix this situation while still using python 3.7 are appreciated.

Update the documentation.

If this project gets a lot of attention, I'll have to update the documentation. Many things changed and are not valid anymore.

Automatic tag and correspondent colors

Inspired by what for example Grafana does (and due to the limited set of colors available in paperless), I patched it to auto-color tags and correspondents with a https://github.com/zenozeng/color-hash like implementation:

I understand that people might like choosing the colors themselves, so feel free to close this. ๐Ÿ˜„ It's just that I don't care about the actual color of tags, I just want them to have different ones and I don't want to bothered with selecting them.

The change_storage_type script is entirely busted.

It just does not work with the new file handling logic that came into paperless earlier this year.

It will crash and leave your document archive in an intermediate state, from which no automatic recovery is possible.

Delay consumption of new files

I might be a little too nervous but I'm just too excited. I just tested direct scanning into the consumption directory, which leads to the following warning in the log:

`11/25/20, 10:02 PM WARNING Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!

11/25/20, 10:02 PM INFO Consuming doc20201125222904.pdf`

The stdout log of the webserver container shows the following:
`22:02:09 [Q] INFO Enqueued 1
22:02:09 [Q] INFO Process-1:1 processing [doc20201125222904.pdf]
Consuming doc20201125222904.pdf
Parser: RasterisedDocumentParser based on mime type application/pdf
Generating thumbnail for doc20201125222904.pdf...
Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /usr/src/paperless/src/../consume/doc20201125222904.pdf[0] /tmp/paperless/paperless-3vtrk65p/convert.png
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.

Requested FirstPage is greater than the number of pages in the file: 0
No pages will be processed (FirstPage > LastPage).
convert-im6.q16: no images defined /tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258. Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml! **** Error: Cannot find a 'startxref' anywhere in the file. Output may be incorrect. **** Error: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** However, the output may be incorrect. **** Error: Trailer dictionary not found. Output may be incorrect. No pages will be processed (FirstPage > LastPage). Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /tmp/paperless/paperless-3vtrk65p/gs_out.png /tmp/paperless/paperless-3vtrk65p/convert.png convert-im6.q16: unable to open image /tmp/paperless/paperless-3vtrk65p/gs_out.png': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `/tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258.
Deleting directory /tmp/paperless/paperless-3vtrk65p
22:02:09 [Q] ERROR Failed [doc20201125222904.pdf] - Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png'] : Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 49, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/usr/src/paperless/src/../consume/doc20201125222904.pdf[0]', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 132, in try_consume_file
thumbnail = document_parser.get_optimised_thumbnail()
File "/usr/src/paperless/src/documents/parsers.py", line 168, in get_optimised_thumbnail
return self.optimise_thumbnail(self.get_thumbnail())
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 73, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 138, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

`

Is it possible to delay the consumption of new files while they're being written to avoid this?

Find and delete duplicates

Hi (again),

One cool feature would be to compare document contents and identify duplicates, maybe upon document consumption and/or in a separate workflow.

Once again, great kudos for the work, keep that up!

Document consumer won't see renamed files.

Some scanners like to write files as "something.pd~", which does not get picked up by the consumer, and then rename the file. The consumer ignores these files.

Also: The consumer should expect the files to have moved before consuming. Implement checks.

PAPERLESS_FORCE_SCRIPT_NAME does not work.

This is required to run Paperless on a sub path, i.e. localhost:8000/paperless/.

This is generally required if you wish to use a proxy server with many different services.

Right now, this does not work.

As for the front end, the src/documents/static/index.html contains a <base href="/">, which has to be adapted to href="/paperless" or similar.

This is easy since the index page is served as a django view and template context is available. See

class IndexView(TemplateView):

The front end further has some static assets for logos, which are always pointing towards <host>/assets/image.png. This is unsolved right now.

Furthermore, login/logout paths have to respect this path, as well as the admin/ link on the front end.

I don't need this, but if anyone cares about this, feel free to look into it.

Traceback on classifier.predict_tags()

First of all: Thanks for the awesome work and all the effort you put into this!

I wanted to try the classifier with parts of my data and ran into a traceback on consumption

Traceback (most recent call last):
  File "/usr/src/paperless/src/documents/consumer.py", line 171, in try_consume_file
    document_consumption_finished.send(
  File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 177, in send
    return [
  File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 178, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "/usr/src/paperless/src/documents/signals/handlers.py", line 127, in set_tags
    matched_tags = matching.match_tags(document.content, classifier)
  File "/usr/src/paperless/src/documents/matching.py", line 36, in match_tags
    predicted_tag_ids = classifier.predict_tags(document_content)
  File "/usr/src/paperless/src/documents/classifier.py", line 224, in predict_tags
    tags_ids = self.tags_binarizer.inverse_transform(y)[0]
  File "/usr/local/lib/python3.8/dist-packages/sklearn/preprocessing/_label.py", line 1017, in inverse_transform
    if yt.shape[1] != len(self.classes_):
IndexError: tuple index out of range

I did also see a warning during "create_classifier" run, which I bluntly ignored ofc. ๐Ÿ˜„

270 documents, 1 tag(s), 1 correspondent(s), 0 document type(s).
Vectorizing data...
Training tags classifier...
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
Training correspondent classifier...
There are no document types. Not training document type classifier.

Tags from consumer sub directories

I made a small patch to allow me to set tags based on sub directories of the consumer directory (jayme-github/paperless#2) and I would like to know if you would accept that feature to your fork.
Looking at the feature-ocrmypdf branch you switched back to using INotify again, so my patch would also bring back the ability to run a recursive consumer.

Let me know what you think and I'll rebase against whatever branch makes sense.

Limit tesseract to 1 thread.

Each tesseract process launches multiple threads that do OCR on a single page in parallel. This marginally improves speed.

However: paperless processes multiple pages in parallel (up to cpu count). This results in a massive performance decrease.

Fix by settings OMP_THREAD_LIMIT=1.

Trigger re-parsing or auto-assignment of consumed documents

Hi there,

first of all - I'm really impressed by the work you did on paperless. I've noticed the issue you opened on the original paperless project and then quickly decided to give it a go. The resource optimizations and feature enhancements you made are quite impressive. Wow!

One thing I'm kind of missing though is - I've now imported a bunch of PDF documents. It all went fine but I went on creating correspondents, document types and tags on-the-fly. Now I've got the most common information in there and I'd like to re-run auto assignment on my existing documents. I didn't (yet) find it in the awesome documentation. Can this somehow be accomplished?

Thank you so much for the great work you're doing with this project!

Cascading filters.

When filtering for a specific correspondent, subsequent filters for tags or types should only display options which are still available.

Integration with OCRmyPDF

Add the ocr'ed text as a text layer to the scanned documents so that text can be copied from them.

Implement proper navigation between documents

When editing documents, the "Next button" should always reflect the next document in the view that was previously visible (I.e. Search results, document list, saved view) and should reload data as necessary.

  • There should also be a previous button

Log scrolling fails

With response: "Enter a valid date/time."

Request URL:

/api/logs/?page=1&page_size=25&ordering=-created&created__lt=2020-11-11T22:13:57.749728+01:00&level__gte=20

Change saved filters

Maybe I understand the idea wrong but I would be kind of stuck once I saved a view. When I open it again, I cannot change the filters, the sort order or anything else. This is bad from my point of view for two reasons:

  1. You have to start from scratch in order to just adapt a small thing
  2. you cannot use a dashboard for further drilling down into your documents

Idea would be to maybe

  1. just use the normal controls for adding/changing filters and order
  2. add some form of save / save as option

Implement a sanity checker

That

  • checks if all originals are in place
  • checks if the checksums match
  • checks if there are unreferenced files
  • checks if all thumbnails are there
  • checks if the mime type of all documents matches

Basically a program that's executed in the background once in a while and tells you that everything is all right.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.