Git Product home page Git Product logo

epaper-scraper's Introduction

Epaper Scraper

This is a Python-based web scraper for extracting data from online newspapers. Tested on Jugantor and Prothom Alo newspaper from 2012 to 2024 in a Windows 11 machine.

Installation

Prerequisites

Before installing the scraper, ensure you have the following prerequisites:

  • A Windows machine (Tested on Windows 11)
  • Python 3.12.0 installed on your system
  • Anaconda for installing some packages
  • NVIDIA CUDA Toolkit 12.4 for GPU accelerated processes
  • Firefox installed as your browser. Make sure to install it in C:\Program Files\Mozilla Firefox\

Installation Steps

CUDA Toolkits

Only NVIDIA GPUs are supported for now and the ones which are listed on this page. If your graphics card has CUDA cores, then you can proceed further with setting up things. If not, contact the developer.

  1. Make sure that Nvidia drivers are upto date.

  2. Add anaconda to the environment and run the following commands in the command prompt.

conda install numba
conda install cudatoolkit

NOTE: If Anaconda is not added to the environment then navigate to anaconda installation and locate the Scripts directory and open the command prompt there.

Tesseract

  1. Download the Tesseract OCR executable from here.

  2. Install Tesseract OCR by following the installation instructions provided in the repository. Make sure to install it in C:\Program Files (x86)\Tesseract-OCR.

  3. Open a command prompt or Anaconda prompt.

  4. Navigate to the directory where you have cloned or downloaded the epaper-scraper repository.

  5. Create and activate a virtual environment (optional but recommended):

    python -m venv venv
    venv\Scripts\activate
  6. Install the required Python packages using pip:

    pip install -r requirements.txt
  7. Test if Tesseract OCR is installed correctly by opening a Python prompt and running:

    import pytesseract
    print(pytesseract)

    If you don't encounter any errors, Tesseract OCR is installed successfully.

Usage

There are two ways to use this software: With GUI and Without GUI.

To use the epaper-scraper With GUI, follow these steps:

  1. Run main.py from src, which will initiate a desktop application like the following one:

Epaper Scraper Interface

  1. Navigate through the interface for using the supported capabilities of the software.

Note: The GUI lacks advanced features which are available in the "Without GUI" version. The interface is being constantly updated to implement these features.

To use the advanced features of epaper-scraper Without GUI, follow these steps:

  1. Click and run start_firefox.bat file. Alternatively run the commands from cmd.txt. This will initilize a firefox browser instance.

  2. Call functions and adjust parameters from the python files of src and run. Example:

    python main.py
  3. The scraper will start extracting data from the specified newspaper website and save it to the specified output directory.

epaper-scraper's People

Contributors

hamidurrk avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.