Git Product home page Git Product logo

parsee-core's Introduction

Parsee AI

Parsee AI is an opinionated, high-level, open source data extraction and structuring framework created by SimFin. Our goal is to make the structuring of data from the most common sources of unstructured data (mainly PDFs, HTML files and images) as easy as possible.

Parsee can be used entirely from your local Python environment. We also provide a hosted version of Parsee, where you can edit and share your extraction templates and also run jobs in the cloud: app.parsee.ai.

Parsee is specialized for the extraction of data from a financial domain (tables, numbers etc.) but can be used for other use-cases as well.

While Parsee has first class support for LLMs, the goal of Parsee is also not to limit itself to LLMs, as at least as of today, custom non-LLM models usually outperform LLMs in terms of speed, accuracy and cost-efficiency given that there is a large enough dataset available. At SimFin for example, we are using Parsee for extracting data from hundreds of documents daily, entirely without LLMs as we have already a substantial dataset built up for our tasks.

For handling tables from PDFs and images properly, we also released our open source PDF table extraction package: https://github.com/parsee-ai/parsee-pdf-reader

Installation:

Recommended install with poetry: https://python-poetry.org/docs/

poetry add parsee-core

Alternatively:

pip install parsee-core

Quick Example

Goal:

Given we have some invoices, we want to:

  1. extract the invoice total, but not just the number, also the currency attached to it.
  2. extract the issuer of the invoice

Imports

import os
from parsee.templates.helpers import StructuringItem, MetaItem, create_template
from parsee.extraction.models.helpers import *
from parsee.converters.main import load_document, from_text
from parsee.extraction.run import run_job_with_single_model
from parsee.utils.enums import *

Step 1: create an extraction template

question_to_be_answered = "What is the invoice total?"
output_type = OutputType.NUMERIC

meta_currency_question = "What is the currency?"
meta_currency_output_type = OutputType.LIST # we want the model to use a pre-defined item from a list, this is basically a classification
meta_currency_list_values = ["USD", "EUR", "Other"] # any list of strings can be used here

meta_item = MetaItem(meta_currency_question, meta_currency_output_type, list_values=meta_currency_list_values)

invoice_total = StructuringItem(question_to_be_answered, output_type, meta_info=[meta_item])

let's also define an item for the issuer of the invoice

invoice_issuer = StructuringItem("Who is the issuer of the invoice?", OutputType.ENTITY)

job_template = create_template([invoice_total, invoice_issuer])

As an alternative to using extraction templates, you can also run your own custom prompts, more in tutorial 10.

Step 2: define a model

In the following we will use the Mixtral model from Replicate, requires an API key: https://replicate.com/

replicate_api_key = os.getenv("REPLICATE_KEY")
replicate_model = replicate_config(replicate_api_key, "mistralai/mixtral-8x7b-instruct-v0.1")

If you intend to use a model that has multimodal capabilities such as GPT 4 or Claude 3, you can enable the multimodal queries by setting the multimodal setting to True (you can also specify how many images should be passed to the modal at most and the maximum size for each image):

gpt_model = gpt_config(os.getenv("OPENAI_KEY"), None, openai_model_name="gpt-4-turbo", multimodal=True, max_images=3, max_image_size=2000)

or for Anthropic models:

anthropic_model = anthropic_config(os.getenv("ANTHROPIC_KEY"), "claude-3-opus-20240229", None, multimodal=True, max_images=1, max_image_size=800)

of course you can also load a locally hosted model with Ollama:

ollama_model = ollama_config("mistral")

Step 3: load a document

Parsee converts all data (strings, file contents etc.) to a standardized format, the class for this is called StandardDocumentFormat.

a) Let's first create a StandardDocumentFormat object from a simple string

input_string = "The invoice total amounts to 12,5 Euros and is due on Feb 28th 2024. Invoice to: Some company LLC. Thanks for using the services of CloudCompany Inc."
document = from_text(input_string)

b) We can also simply load and convert files into the StandardDocumentFormat with the help of the converters that are included in Parsee, let's use an actual PDF invoice now

file_path = "../tests/fixtures/Midjourney_Invoice-DBD682ED-0005.pdf"
document = load_document(file_path)

Step 4: run the extraction

_, _, answers_open_source_model = run_job_with_single_model(document, job_template, replicate_model)

If we look at the answers of the model we get the following:

answers_open_source_model[0].class_value
>> '11.9'
answers_open_source_model[0].meta[0].class_value
>> 'USD'

We can also use a different model to run the same extraction:

# enter your key manually here or load from an .env file
open_ai_api_key = os.getenv("OPENAI_KEY")
gpt_model = gpt_config(open_ai_api_key)

_, _, answers_gpt = run_job_with_single_model(document, job_template, gpt_model)

Full Tutorials

  1. Extraction Templates & Basics: Python Code.

  2. Meta Items: Python Code.

  3. Table Extraction: Python Code.

  4. Loading Templates from Parsee Cloud: Python Code.

  5. Saving Templates: Python Code.

  6. Datasets: Python Code.

  7. Model Evaluations: Python Code.

  8. Langchain Integration: Python Code.

  9. Multimodal Models: Python Code.

  10. Using Parsee Cloud to Load Images for Multimodal Applications: Python Code.

  11. Custom Prompts: Python Code.

Basic Rules for Extraction Templates

In the following, we will only focus on the "general questions" items of the extraction templates. The logic for table detection/structuring items is quite similar and we will add some more explanations for them in the future.

Base Logic

In the most basic sense, every question you define under the "general questions" category can have exactly one answer.

If no answer can be found for a question (or meta item), the answer can always be „n/a“, meaning that the parsing of the values was not successful or the model did not have an answer. In that sense, all outputs are "nullable" but will be represented by the string value "n/a" in case they are null.

A question can have more than one answer only when there is a meta item defined, which will create an „axis“ along which the model can give different answers to the same question.

Example

For the question: What is the invoice total? (output type numeric)

If there is no meta item defined, the model can answer this question only with one number (because of the numeric output type) or with "n/a".

e.g.

  • Invoice total: 10.0

OR

  • Invoice total: 21.5

OR

  • Invoice total: n/a

If for some reason you want the model to not just respond with one answer, but in case there are maybe several different answers to a question for a single document, you can add a meta item.

If we define a meta item „invoice date“ and attach it to the invoice total, the model can now theoretically give several answers for the same document, differentiated by their meta ID:

e.g.

  • (first answer) Invoice total: 10.0 as per 2022-03-01

AND

  • (second answer) Invoice total: 24.0 as per 2022-06-01

So you can imagine the meta items as a sort of „key“, in the sense that as long as the meta values differ for 2 items, their keys will be different. All output values can be imagined as key value pairs.

parsee-core's People

Contributors

thf24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.