MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
Key features include:
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
pdf_zh_cn.mp4
- PDF-Extract-Kit
- A Comprehensive Toolkit for High-Quality PDF Content Extraction
- Python >= 3.9
It is recommended to use a virtual environment, either with venv or conda. Development is based on Python 3.10, should you encounter problems with other Python versions, please switch to Python 3.10.
# If you only need the basic features (without built-in model parsing functionality)
pip install magic-pdf
# or
# For complete parsing capabilities (including high-precision model parsing)
pip install magic-pdf[full-cpu]
# For high-precision model parsing, you will need to install the dependency detectron2.
# For detectron2, compile it yourself as per https://github.com/facebookresearch/detectron2/issues/5114
# Or use our precompiled wheel
# windows
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-win_amd64.whl
# linux
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-linux_x86_64.whl
# macOS(Intel)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl
# macOS(M1/M2/M3)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_11_0_arm64.whl
For detailed references, please see belowhow_to_download_models
After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
# Copy the configuration file to the root directory
cp magic-pdf.template.json ~/magic-pdf.json
In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
{
"models-dir": "/tmp/models"
}
#If the full version is installed, you can invoke the built-in models for parsing.
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf". You can find the corresponding xxx_model.json file in the markdown directory. If you intend to do secondary development on the post-processing pipeline, you can use the command:
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
In this way, you won't need to re-run the model data, making debugging more convenient.
magic-pdf --help
You need to install the corresponding PyTorch version according to your CUDA version.
# When using the GPU solution, you need to reinstall PyTorch for the corresponding CUDA version. This example installs the CUDA 11.8 version.
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
{
"device-mode":"cuda"
}
For macOS users with M-series chip devices, you can use MPS for inference acceleration. You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
{
"device-mode":"mps"
}
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Demo can be referred to demo.py
Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
Key Features Include:
-
Web Page Extraction
- Cross-modal precise parsing of text, images, tables, and formula information.
-
E-Book Document Extraction
- Supports various document formats including epub, mobi, with full adaptation for text and images.
-
Language Type Identification
- Accurate recognition of 176 languages.
extract1.mp4
extract2.mp4
extract3.mp4
- Magic-Doc Outstanding Webpage and E-book Extraction Tool
The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
@misc{2024mineru,
title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
author={MinerU Contributors},
howpublished = {\url{https://github.com/opendatalab/MinerU}},
year={2024}
}