The invoicepdf2data from senthilsweb

senthilsweb / invoicepdf2data Goto Github PK

View Code? Open in Web Editor NEW

Extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision). searches for regex in the result using a YAML-based template system saves results as CSV, JSON or XML or renames PDF files to match the content.

Python 100.00%

invoicepdf2data's Introduction

Data extractor for PDF invoices

extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision).
searches for regex in the result using a YAML-based template system
saves results as CSV, JSON or XML or renames PDF files to match the content.

INSTALLATION OF VIRTUAL ENVIRONMENT AND FLASK:

commands to install virtual environment:

sudo apt-get install python3-pip
sudo pip3 install virtualenv
virtualenv venv
source venv/bin/activate
pip install Flask
pip install pymongo
pip install invoice2data


AFTER INSTALLATION GO TO PROJECT DIRECTORY:

eg:-
(venv) taher@ubuntu:~/projects/invoice_reader_ai$


RUN THE FOLLOWING COMMAND FROM PROJECT DIRECTORY:
export FLASK_APP=pdfinvoice2data.py
flask run

GO TO YOUR BROWSER RUN THE PROJECT:
http://127.0.0.1:5000/


Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Recommend Projects