Git Product home page Git Product logo

pdf2text's Introduction

PDF Text Extractor and Recipe Preprocessor

This Python project extracts text from PDF files using the PyPDF2 library and preprocesses the extracted text to format recipe information.

Features

  • Extracts text from PDF files
  • Preprocesses extracted text to format recipe information
  • Saves both raw extracted text and processed recipe text
  • Easy to use and modify for specific needs

Requirements

  • Python 3.6+
  • PyPDF2 library
  • NLTK library

Installation

  1. Clone this repository:

    git clone https://github.com/houmairi/pdf2text.git
    cd pdf2text
    
  2. Create a virtual environment and activate it: (optional)

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
    
  3. Install the required packages:

    pip install -r requirements.txt
    

Usage

  1. Place your PDF file in the books directory (or modify the pdf_path in the script).

  2. To extract text from the PDF without preprocessing:

    python main.py
    

    The extracted raw text will be saved in books/extracted_text.txt.

  3. To extract text from the PDF and preprocess it:

    python main.py -p
    

    The extracted raw text will be saved in books/extracted_text.txt, and the processed recipe text will be saved in books/processed_recipes.txt.

Example

In the /books directory, you can find examples of the script's output:

  • examplebook_extracted_text.txt: Contains the raw extracted text from a sample cooking book PDF.
  • examplebook_processed_recipes.txt: Contains the processed and formatted recipe information.

Customization

You can modify the following in main.py:

  • pdf_path: Change the input PDF file location
  • extracted_text_file: Change the output location for raw extracted text
  • processed_recipes_file: Change the output location for processed recipe text

You can also modify preprocess_recipes.py to adjust the recipe preprocessing logic according to your needs.

Project Structure

  • main.py: The main script that handles PDF text extraction and calls the preprocessing function
  • preprocess_recipes.py: Contains the logic for preprocessing and formatting cooking recipe information
  • requirements.txt: Lists all the Python dependencies for the project

License

This project is open source and available under the MIT License.

Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

pdf2text's People

Contributors

houmairi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.