This project will be used for DART pdf object detection task.
Clone this repo.
git clone https://github.com/snu-dm/DART_pdf_extractor.git
cd DART_pdf_extractor/
This code requires python 3+ and pdfplumber. Please Install dependencies by
pip install -r requirements.txt
Use following parsers in your Linux terminal:
- text extraction (-text)
- table extraction (-table)
- image extraction (-image)
- caption extraction (-caption)
- pdf_dir (-dir)
- save_dir (-save)
- cropped_file_only (-crop)
- total_page_with_segmentation (-page)
Example Usage:
python main.py -text -table -image -caption -crop -page