pdftolatex is a simple tool that essentially "decompiles" a PDF file into the LaTex code that would have been used to create the PDF in the first place. Being a college student who uses LaTex for notes and homework typesetting, I created this tool after getting frustrated by all the time and effort I was spending copying down a homework templates or notes before working on them myself. pdftolatex helps reduce some of the grind.
The LaTex code generated by pdftolatex:
- Incudes the non-textual elements of the PDF in the LaTex code as part of figure environments (cropped images of these non-textual elements are stored in a local dir)
- Includes the a default preamble (which can be customized if desired)
- Formats the code by seperating all paragraphs from the PDF using the \vspace command
To use pdftolatex run convert_pdf.py
with either the --filepath
argument to convert a single PDF or the --folderpath
argument to convert every PDF file in the folder.
python convert_pdf.py --filepath example.pdf
python convert_pdf.py --folderpath example/
- OpenCV4 (cv2)
- pytesseract
- pillow
- tqdm
- Implement ML model to classify non-text regions
- Implement ML models to generate LaTex code for tables and equations instead of merely including their pictures
- Create Web Interface