openpaperwork / paperwork Goto Github PK
View Code? Open in Web Editor NEWPersonal document manager (Linux/Windows) -- Moved to Gnome's Gitlab
Home Page: https://gitlab.gnome.org/World/OpenPaperwork/paperwork
Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab
Home Page: https://gitlab.gnome.org/World/OpenPaperwork/paperwork
When a document is created, the (internal) document lists and the (GUI) list of matching documents should be updated
Add an image import function. --> one image == one page ; one directory == one document.
Paperwork should only look for a scanner when the user try to scan something.
Once #20 will be implemented, a support for inotify would be nice. This way, if the user or another software modifies a file, the user won't have to restart Paperwork
At startup, paperwork reindex all the documents. It could be a good idea to store keywords in a sqlite3 database. This way, at startup, paperwork would only have to check modifications timestamps on the files.
Calibration: When scanning, a busy mouse cursor should be displayed
When i run paperwork, there are some error messages on console. It seems that paperwork try to open a nonexistant document 20120104_2143_30 which correspond to the current date.
$ paperwork
No handlers could be found for logger "pycountry.db"
Looking for locales in 'locale/fr/LC_MESSAGES/paperwork.mo' ...
Will use locales from 'locale'
Config file found: /home/chris/.paperwork.conf
Try to used UI file ./mainwindow.glade but failed: L'ouverture du fichier « ./mainwindow.glade » a échoué : Aucun fichier ou dossier de ce type
UI file used: src/mainwindow.glade
Main window resized
Exception while trying to get the number of pages of '20120104_2143_30': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30'
Showing first page of the doc
Showing page '20120104_2143_30 p1'
Unable to show image for '20120104_2143_30 p1': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.jpg'
Unable to read [/home/chris/papers/20120104_2143_30/paper.1.txt]: [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.txt'
When there is no scanner available, the button "scan next page" in the tabs is still sensitive.
Someone suggested being able to import PDF could be useful.
The idea is the following: Instead of focusing only on papers, Paperwork should be able to deal with any kind of bill, letter, whatever. And everybody get some of these by email, in PDF format. This way, instead of having to do 2 search: one with something like beagle or whatever, and one with Paperwork, the user could simply do one single search in Paperwork.
--> it is really useful ?
--> maybe it could be done as a plugin ? (which implies that a plugin mechanism must be implemented)
When Paperwork starts, it looks for a scanner, and disable "scan" buttons if none is found. In case none is found, it would be nice to display a popup with a "retry" button.
Add an option in the advanced menu to reindex documents + labels.
When a page is scanned, 2 calls to tesseract are made (to figure out the orientation of the page). On multi-CPU computers, these calls could be done in parallel.
The 3 tabs are not really ergonomic.
A nicer layout would be:
Using gnome vfs could make things easier on some points: papers could be stored on anything accessible through GIO (samba, ftp, ssh, etc).
Also, it could make the use of the Trash folder easier (see issue 19]
When searching, it would be sometimes useful to reject documents containing specific keywords.
For instance "accident" yields all my earnings statements when I'm actually looking for documents related to my last bike crash.
Indexation, scanning, etc should be done in separate threads. Using Gobject, synchronization with the gtk thread could be easily handled.
In the multi-scan dialog, there is a missing feedback: The number of pages already scanned
Tesseract is nice, but Cuneiform appears to be pretty good as well (the output is much cleaner). Having Cuneiform support would be nice.
However, it raises a question: Must that appear in the GUI ? Lambda users probably don't care what system they use as long it works fine.
Imho, I think it would be better if Paperwork autodetects OCR systems and use a preference list (for instance 1) cuneiform if available, 2) Tesseract if available, 3) complains to the user). It would keep the GUI simpler.
The search bar should be in the main window. Results should take the place of the pages list.
In the settings window, the languages are always in English. Translating at least the most common would be nice.
(beware of UTF-8 issues)
It would be better if the scan progression could be reported to the user. However, it seems that this will require to send a patch to whoever is responsible for python-imaging-sane.
The basic idea is the following one: When you look for documents, you usually think first of the keywords in the title(s) of the target document(s). Usually, these words are written bigger than the other.
--> When documents are found, it would nice to give them scores based on the size of the font used to write these keywords. Next the search results could ordered using this score.
"Delete" options are tab specific. They should be a button at the bottom of each tab (like the "add label" button for instance).
Labels are currently only deleted when no document use them anymore.
Would be nicer if a right click menu allowed to remove one of them from all documents or to edit them (like changing their name of their color)
Actually, only page per page scan is supported. Support for scanner feeder should be added as well (see simple-scan for instance)
It could be useful to actually zip each document:
When (re)indexing the documents, the progress bar remains at 100% much longer than it should.
Keywords are now hightlighted in the document. It would be even better if search keywords would be highlighted as well.
In the multi-scan dialog, it would be nice to be able to add pages as well to the current document.
Currently Paperwork suggestions are based on each keyword individually. But when we are searching with many keywords, suggestions made due to one keyword can be incompatible with the other one(s).
--> Would be nice it would only suggest useful corrections.
(based on the Gnome recommendations)
A 'cancel' option could be quite useful for some operations. Mostly those regarding deletion of pages or documents (using the recycling bin could be a good thing too).
There are a lot of Tesseract languages available. Instead of using a predefined list, it should be generated based on the trained data files available.
When the user interrupts the scan of the first page of document (by pressing a button on the scanner itself or Ctrl-C in the terminal), Paperwork leaves an empty directory.
Currently it's only possible to remove whole document. Removing single page can prove useful in case of crappy scans.
(page scanned twice, or with a poor orientation, etc)
In the calibration frame, the mouse cursor should change, depending if we are on the preview or on a grip.
When scanning multiple documents, it would be handy to be able to specify which labels must be put on the document before even scanning them.
For instance, someone starting with Paperwork may want to scan all their earning statements in one shot and have the same label put on all of them.
Paperwork should display automatically the "page" tab when:
Labels with accents in their name cannot be selected or edited. To fix.
It would be better to only keep the useful part of each scan (in other word, the A4 sheet only, and not what's around).
To do so, a scanner calibration window should be added to the settings.
Currently, if tesseract is not installed, Paperwork starts as usual. Would be nice to have a popup to warn us of this problem. Also scanning and ocr should be disabled.
==> if not sane or not tesseract --> popup + disable scan/ocr
If you do a calibration scan, it is not possible anymore to do a normal scan. If you run Paperwork in a terminal, an exception is raised with the error message "Device is busy".
There are 2 possibilities:
Probably due to the conversion / resize process they go through.
i18n/l10n to do
When deleting a document, the search result list is not updated.
When deleting a page, there seem to be too many refreshs.
Also the busy cursor doesn't appear half of the time.
Also a label seem to be missing beside the progress bar.
Being able to tag documents would be nice feature. For instance tags like "bank", "salary", "bike crash", etc ... :)
Having colors on these tags would also be awesome.
It would be nice to be able to export documents as PDF (for emails for instance).
http://tfischernet.wordpress.com/2008/11/26/searchable-pdfs-with-linux/
http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
The only issue here would be the PDF quality in the end (no display issue in some readers ? no weird behavior with Ctrl+F ?)
When you display a new page, the scroll bar remains where it was on the previous page. Most of the time, the user wants to start reading from the beginning of the page. So the scroll bar should go back up when the user changes the current page.
Since v3, tesseract is able to generate hOCR file. Since v3.01, it is shipped with a configuration file for that (Tesseract Issue 377)
Using this format would avoid having to assemble the .txt and .box file each time the user want to see a page.
Document id are basically the date and time at which they were scanned. Instead of displaying them as-is (YYYYMMDD_HHmm_SS), they could be display in a nicer and localized way: Something like "Thu. 21st September 2011" for instance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.