Git Product home page Git Product logo

open-parse's Introduction


Easily chunk complex documents the same way a human would.

Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.

Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.

How is this different from other layout parsers?

✂️ Text Splitting

Text splitting converts a file to raw text and slices it up.

  • You lose the ability to easily overlay the chunk on the original pdf
  • You ignore the underlying semantic structure of the file - headings, sections, bullets represent valuable information.
  • No support for tables, images or markdown.

🤖 ML Layout Parsers

There's some of fantastic libraries like layout-parser.

  • While they can identify various elements like text blocks, images, and tables, but they are not built to group related content effectively.
  • They strictly focus on layout parsing - you will need to add another model to extract markdown from the images, parse tables, group nodes, etc.
  • We've found performance to be sub-optimal on many documents while also being computationally heavy.

💼 Commercial Solutions

  • Typically priced at ≈ $10 / 1k pages. See here, here and here.
  • Requires sharing your data with a vendor

Highlights

  • 🔍 Visually-Driven: Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.

  • ✍️ Markdown Support: Basic markdown support for parsing headings, bold and italics.

  • 📊 High-Precision Table Support: Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.

    Examples The following examples were parsed with unitable.



  • 🛠️ Extensible: Easily implement your own post-processing steps.

  • 💡Intuitive: Great editor support. Completion everywhere. Less time debugging.

  • 🎯 Easy: Designed to be easy to use and learn. Less time reading docs.


Example

Basic Example

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

📓 Try the sample notebook here

Semantic Processing Example

Chunking documents is fundamentally about grouping similar semantic nodes together. By embedding the text of each node, we can then cluster them together based on their similarity.

from openparse import processing, DocumentParser

semantic_pipeline = processing.SemanticIngestionPipeline(
    openai_api_key=OPEN_AI_KEY,
    model="text-embedding-3-large",
    min_tokens=64,
    max_tokens=1024,
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(basic_doc_path)

📓 Sample notebook here

Requirements

Python 3.8+

Dealing with PDF's:

Extracting Tables:

  • PyMuPDF has some table detection functionality. Please see their license.
  • Table Transformer is a deep learning approach.
  • unitable is another transformers based approach with state-of-the-art performance.

Installation

1. Core Library

pip install openparse

Enabling OCR Support:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.

The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.

So for a working OCR functionality, make sure to complete this checklist:

  1. Install Tesseract.

  2. Locate Tesseract’s language support folder. Typically you will find it here:

    • Windows: C:/Program Files/Tesseract-OCR/tessdata

    • Unix systems: /usr/share/tesseract-ocr/5/tessdata

  3. Set the environment variable TESSDATA_PREFIX

    • Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"

    • Unix systems: declare -x TESSDATA_PREFIX= /usr/share/tesseract-ocr/5/tessdata

Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!

2. ML Table Detection (Optional)

This repository provides an optional feature to parse content from tables using the state-of-the-art Table Transformer (DETR) model. The Table Transformer model, introduced in the paper "PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents" by Smock et al., achieves best-in-class results for table extraction.

pip install "openparse[ml]"

Then download the model weights with

openparse-download

Cookbooks

https://github.com/Filimoa/open-parse/tree/main/src/cookbooks

Documentation

https://filimoa.github.io/open-parse/

open-parse's People

Contributors

filimoa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.