pdftolatex's Introduction

pdftolatex

Description

pdftolatex is a simple tool that essentially "decompiles" a PDF file into the LaTex code that would have been used to create the PDF in the first place. Being a college student who uses LaTex for notes and homework typesetting, I created this tool after getting frustrated by all the time and effort I was spending copying down a homework templates or notes before working on them myself. pdftolatex helps reduce some of the grind.

The LaTex code generated by pdftolatex:

Incudes the non-textual elements of the PDF in the LaTex code as part of figure environments (cropped images of these non-textual elements are stored in a local dir)
Includes the a default preamble (which can be customized if desired)
Formats the code by seperating all paragraphs from the PDF using the \vspace command

Usage

To use pdftolatex run convert_pdf.py with either the --filepath argument to convert a single PDF or the --folderpath argument to convert every PDF file in the folder.

python convert_pdf.py --filepath example.pdf
python convert_pdf.py --folderpath example/

Notes

Packages Required

OpenCV4 (cv2)
pytesseract
pillow
tqdm

Future Improvements

Implement ML model to classify non-text regions
Implement ML models to generate LaTex code for tables and equations instead of merely including their pictures
Create Web Interface

Recommend Projects

carlccxx / pdftolatex Goto Github PK

pdftolatex's Introduction

pdftolatex

Description

Usage

Notes

Packages Required

Future Improvements

pdftolatex's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent