Git Product home page Git Product logo

shakewely / mineru Goto Github PK

View Code? Open in Web Editor NEW

This project forked from opendatalab/mineru

0.0 0.0 0.0 51.4 MB

MinerU is a one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

Home Page: https://opendatalab.com/OpenSourceTools

License: GNU Affero General Public License v3.0

Python 100.00%

mineru's Introduction

MinerU

Introduction

MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:

Magic-PDF

Introduction

Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.

Key features include:

  • Support for multiple front-end model inputs
  • Removal of headers, footers, footnotes, and page numbers
  • Human-readable layout formatting
  • Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
  • Extraction and display of images and tables within markdown
  • Conversion of equations into LaTeX format
  • Automatic detection and conversion of garbled PDFs
  • Compatibility with CPU and GPU environments
  • Available for Windows, Linux, and macOS platforms
pdf_zh_cn.mp4

Project Panorama

Project Panorama

Flowchart

Flowchart

Submodule Repositories

  • PDF-Extract-Kit
    • A Comprehensive Toolkit for High-Quality PDF Content Extraction

Getting Started

Requirements

  • Python >= 3.9

It is recommended to use a virtual environment, either with venv or conda. Development is based on Python 3.10, should you encounter problems with other Python versions, please switch to Python 3.10.

Usage Instructions

1. Install Magic-PDF

# If you only need the basic features (without built-in model parsing functionality)
pip install magic-pdf
# or
# For complete parsing capabilities (including high-precision model parsing)
pip install magic-pdf[full-cpu]

# For high-precision model parsing, you will need to install the dependency detectron2.
# For detectron2, compile it yourself as per https://github.com/facebookresearch/detectron2/issues/5114
# Or use our precompiled wheel

# windows
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-win_amd64.whl

# linux
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-linux_x86_64.whl

# macOS(Intel)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl

# macOS(M1/M2/M3)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_11_0_arm64.whl

2. Downloading model weights files

For detailed references, please see belowhow_to_download_models

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.

3. Copy the Configuration File and Make Configurations

# Copy the configuration file to the root directory
cp magic-pdf.template.json ~/magic-pdf.json

In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

{
  "models-dir": "/tmp/models"
}

4. Usage via Command Line

simple
#If the full version is installed, you can invoke the built-in models for parsing.
magic-pdf pdf-command --pdf "pdf_path" --inside_model true

After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf". You can find the corresponding xxx_model.json file in the markdown directory. If you intend to do secondary development on the post-processing pipeline, you can use the command:

magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"

In this way, you won't need to re-run the model data, making debugging more convenient.

more
magic-pdf --help

5. Acceleration Using CUDA or MPS

CUDA

You need to install the corresponding PyTorch version according to your CUDA version.

# When using the GPU solution, you need to reinstall PyTorch for the corresponding CUDA version. This example installs the CUDA 11.8 version.
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118

Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.

{
  "device-mode":"cuda"
}
MPS

For macOS users with M-series chip devices, you can use MPS for inference acceleration. You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.

{
  "device-mode":"mps"
}

6. Usage via Api

Local
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Object Storage
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")

Demo can be referred to demo.py

Magic-Doc

Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

  • Web Page Extraction

    • Cross-modal precise parsing of text, images, tables, and formula information.
  • E-Book Document Extraction

    • Supports various document formats including epub, mobi, with full adaptation for text and images.
  • Language Type Identification

    • Accurate recognition of 176 languages.
extract1.mp4
extract2.mp4
extract3.mp4

Project Repository

  • Magic-Doc Outstanding Webpage and E-book Extraction Tool

All Thanks To Our Contributors

License Information

LICENSE.md

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.

Acknowledgments

Citation

@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}

Star History

Star History Chart

mineru's People

Contributors

myhloli avatar dt-yy avatar drunkpig avatar 1shuimo avatar renpengli01 avatar papayalove avatar icecraft avatar gddgcz518 avatar wangbindl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.