Git Product home page Git Product logo

fileganizer's Introduction

Lint Status Build Status

Fileganizer

Fileganizer is a tool that will

  • run a command to extract text from an input file,
  • parse the extracted text with grok-like patterns,
  • choose a pre-configured go-template depending on parsing results,
  • generate a result with go-template,
  • optionaly run the result (as a command).o

The use-case is to run some pdftotext command to extract text from your invoices and other similar documents, try to find patterns like IDs, date, name, and rename (move) the file using the results of the parsing.

Tutorial

Copy config.yaml.sample as config.yaml. Edit the file:

Leave ExtractTextCommand as is if you have pdftotext installed. Or change it if you prefer using another tool.

Leave env as is or declare other environment variables according to your needs. These environment variables will be available in your go-templates.

Leave commonTemplate empty. You will fill it later, according to your needs.

Leave months as is or translate months into your language. This is used to convert months names into number. For example octobre (in French, meaning october can be converted to 10).

Leave grokPatterns as is. You may add new patterns later, according to your needs.

Now we will work with fileDescriptions that contains patterns to try to apply on the input file and output as a go-template that we configure as a shell command.

  1. Run fileganizer -c config.yaml -f yourfile.pdf -t. This will print the output of the ExtractTextCommand.
  2. identify some interesting patterns, for example a date, an identifier...
  3. add these patterns with grok syntax (learn with Grok filter plugin from Logstash). Note that the parser is Grokky and is not fully compatible with Grok.
  4. forge a go-template output with all avaiable variables (.filename, .env.XXX for environment variables, .grok.xxx for parsed data.
  5. Run fileganizer -c config.yaml -f yourfile.pdf (without the -t option). This do all the job and print the generated result.

You can iterate as many times as you need to improve the template. You can also add other fileDescriptions to identify other document types and print from other go-templates.

When you want to run the output as a shell command, add -r option: fileganizer -c config.yaml -f yourfile.pdf -r.

Build

go build

Test

go test ./...

Run

Run fileganizer on a file and print the generated output:

./fileganizer -c <config.yaml> -f <file.pdf>

Run fileganizer on a file and run the generated output:

./fileganizer -c <config.yaml> -f <file.pdf> -r

Show pdf text contents

./fileganizer -c <config.yaml> -f <file.pdf> -t

Environment variables

Name Value
LOG_TXT_FILENAME file to log in, in plain text. Possible values: stdout, stderr, any filename.
LOG_JSON_FILENAME file to log in, in json format. Possible values: stdout, stderr, any filename.
LOG_LEVEL one of debug, info, warn, error, panic or fatal. Default is info

Note : if none of LOG_TXT_FILENAME or LOG_JSON_FILENAME is set, logging will output to stdout in plain text format, same as if LOG_TXT_FILENAME=stdout.

Licensing

This project is licensed under the MIT License. See the LICENSE file for the full license text.

fileganizer's People

Contributors

ymettier avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.