This Python project extracts text from PDF files using the PyPDF2 library and preprocesses the extracted text to format recipe information.
- Extracts text from PDF files
- Preprocesses extracted text to format recipe information
- Saves both raw extracted text and processed recipe text
- Easy to use and modify for specific needs
- Python 3.6+
- PyPDF2 library
- NLTK library
-
Clone this repository:
git clone https://github.com/houmairi/pdf2text.git cd pdf2text
-
Create a virtual environment and activate it: (optional)
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Place your PDF file in the
books
directory (or modify thepdf_path
in the script). -
To extract text from the PDF without preprocessing:
python main.py
The extracted raw text will be saved in
books/extracted_text.txt
. -
To extract text from the PDF and preprocess it:
python main.py -p
The extracted raw text will be saved in
books/extracted_text.txt
, and the processed recipe text will be saved inbooks/processed_recipes.txt
.
In the /books
directory, you can find examples of the script's output:
examplebook_extracted_text.txt
: Contains the raw extracted text from a sample cooking book PDF.examplebook_processed_recipes.txt
: Contains the processed and formatted recipe information.
You can modify the following in main.py
:
pdf_path
: Change the input PDF file locationextracted_text_file
: Change the output location for raw extracted textprocessed_recipes_file
: Change the output location for processed recipe text
You can also modify preprocess_recipes.py
to adjust the recipe preprocessing logic according to your needs.
main.py
: The main script that handles PDF text extraction and calls the preprocessing functionpreprocess_recipes.py
: Contains the logic for preprocessing and formatting cooking recipe informationrequirements.txt
: Lists all the Python dependencies for the project
This project is open source and available under the MIT License.
Contributions, issues, and feature requests are welcome! Feel free to check issues page.