This quickly hacked Python script organizes all PDFs in a given directory by parsing their content. It moves each file into a directory named after the first author (formatted as "Lastname Surname") and renames the file in the format "year title.pdf".
- Content Parsing: Extracts metadata like author names, publication year, and title from PDF files.
- File Organization: Moves PDFs into author-named directories and renames them based on year and title.
- OCR Capability: Performs on-the-fly OCR as needed using pytesseract.
- Python: Ensure Python is installed on your system.
- Python Libraries: Requires specific libraries which can be installed via pip:
pip install PyPDF2 pdf2image pytesseract tqdm openai
- System Dependencies:
- Poppler and Tesseract: Necessary for PDF to image conversion and OCR functionality.
sudo apt-get install poppler-utils tesseract-ocr tesseract-ocr-eng
- Poppler and Tesseract: Necessary for PDF to image conversion and OCR functionality.
- Ollama Setup:
- Model Installation: The script uses a local Ollama instance with a specific LLM model. The default is
cas/spaetzle-v58
, suitable for English and German text. Change the model as per your requirements.ollama pull cas/spaetzle-v58
- Model Installation: The script uses a local Ollama instance with a specific LLM model. The default is
Run the script in the directory containing the PDFs you want to sort:
python parse_pdfs.py ./
This command sorts all PDFs in the current directory. Customize the script's model or configurations as necessary to fit your specific needs.