Git Product home page Git Product logo

lovc21 / web_crawler_kurtz Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 234 KB

Lt-Colonel-Kilgore-team Web Crawler is an assignment project for web scraping .gov websites and storing data in a PostgreSQL database. It offers concurrent scraping, robot.txt and sitemap.xml handling, duplicate page detection, and crawl management. Prerequisites include Docker, Postman, and Python

Dockerfile 0.50% Python 99.50%
webscraper webscraping

web_crawler_kurtz's Introduction

Lt-Colonel-Kilgore-team web crawler

Project description

This project consists of a web crawler that scrapes .gov websites and stores the results in a Postgres database.This web crawler was made for an assignment for the class in information retrieval and extraction. You can read more about the assignment on the following link. The project has two main components:

  1. Database Setup directory ('DatabaseSetup'): This directory contains a docker-compose file for setting up the Postgres database.
  2. Web Crawler script ('WebCrawlerService.py'): This Python script is responsible for the web crawling process.

Features

  • Concurrent scraping using multithreading
  • Support for handling robot.txt and sitemap.xml files
  • Storage of data in a PostgreSQL database
  • Duplicate page detection and handling
  • SQLite database for managing crawl frontier and crawl delays

Necessary programs

  • chromedriver
  • Docker (if on Windows, install docker desktop)
  • Docker Compose
  • Postman
  • IDE (PyCharm/VsCode)
  • Python (3 and up)

Optional programs

  • pgAdmin
  • DB Browser for SQLite

Set up the postgres database

In the DatabaseSetup directory, open a terminal (e.g., PowerShell on Windows) and run the following command:

docker compose up -d

This command will create the PostgreSQL database using the schema provided in the init-schema file.

Starting the web crawler

  1. Open the project in your preferred IDE (e.g., PyCharm or VSCode).
  2. Install the required dependencies by running pip install -r requirements.txt.
  3. In the WebCrawlerService.py script, you can set the number of threads you want to run in the main function by adjusting the max_workers parameter (e.g., max_workers=10).
  4. Run the program using the IDE's run button or use the command flask --app sample --debug run.
  5. To startthe web crawler, send a POST request to the /scrape endpoint using Postman with the following JSON format: Example:
    {
  "messages": [
            "https://www.gov.si/",
            "https://evem.gov.si/",
            "https://e-uprava.gov.si/",
            "https://www.e-prostor.gov.si/"
  ]
}

This will initiate the web scraping process for the specified websites, and the results will be stored in the PostgreSQL database.

web_crawler_kurtz's People

Contributors

lovc21 avatar

Watchers

Slavko avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.