Git Product home page Git Product logo

pdf-to-csv's Introduction

PDF-to-CSV

What does it do?

Converts PDFs into CSVs.

Moreover

Found the solution here:

https://stackoverflow.com/questions/58690461/how-to-convert-pdf-file-to-excel-file-using-python Now just want to add extra features to make it friendly to use.

There are a plethora of solutions:

tabula-py: (the one I'm using) https://github.com/chezou/tabula-py

Camelot: https://github.com/camelot-dev/camelot/tree/master https://camelot-py.readthedocs.io/en/master/

I'll be using existing Libraries:

  • tabula-py: Extracts tables from PDF files
  • pandas: Data manipulation
  • (Its meant to be used in PDFs that have tables, otherwise might not work (or make sense to use))

How can i improve it?

  • Choosing which page is turning into a csv file.
  • Batch Processing:: Automating the conversion of multiple PDF files.(by providing a directory as input )
  • Error Handling: provide info messages with issues such as: missing input files, invalid page numbers, or failed conversions.
  • Documentation: With examples and troubleshooting tips.
  • Customizable Output: Allow users to to customize the output according to their needs. Such as:
    • delimiter: to separate individual fields. Common ones are: commas (,), semicolons (;), tabs (\t), or pipes (|).
    • encoding: Different encodings support different sets of characters. ???
    • Header/footer exclusion: Exclude the name of columns
    • Column removal: maybe certain columns are not desirable in the output.
    • costum header names: We can change the names of the variables to ones that make more sense.

Why I created this Project?

I wanted to manage my spendings and plan my financial life as a responsible indiviudal. My bank has the possibility of retrieving PDFs of previous months (indefinitely). So I wanted to check how much money I spent and in what. That requires labeling it, manually, but thats manageable. What wasn't manageable was copying all the transactions... That's when i looked for libraries to solve this. I FOUND tabula-py !!! Now I just wanted to make it easy to use.

pdf-to-csv's People

Contributors

goncascartaxana avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.