lovc21 / web_crawler_kurtz Goto Github PK

View Code? Open in Web Editor NEW

Lt-Colonel-Kilgore-team Web Crawler is an assignment project for web scraping .gov websites and storing data in a PostgreSQL database. It offers concurrent scraping, robot.txt and sitemap.xml handling, duplicate page detection, and crawl management. Prerequisites include Docker, Postman, and Python

Dockerfile 0.50% Python 99.50%

web_crawler_kurtz's Introduction

Lt-Colonel-Kilgore-team web crawler

Project description

This project consists of a web crawler that scrapes .gov websites and stores the results in a Postgres database.This web crawler was made for an assignment for the class in information retrieval and extraction. You can read more about the assignment on the following link. The project has two main components:

Database Setup directory ('DatabaseSetup'): This directory contains a docker-compose file for setting up the Postgres database.
Web Crawler script ('WebCrawlerService.py'): This Python script is responsible for the web crawling process.

Features

Concurrent scraping using multithreading
Support for handling robot.txt and sitemap.xml files
Storage of data in a PostgreSQL database
Duplicate page detection and handling
SQLite database for managing crawl frontier and crawl delays

Necessary programs

chromedriver
Docker (if on Windows, install docker desktop)
Docker Compose
Postman
IDE (PyCharm/VsCode)
Python (3 and up)

Optional programs

pgAdmin
DB Browser for SQLite

Set up the postgres database

In the DatabaseSetup directory, open a terminal (e.g., PowerShell on Windows) and run the following command:

docker compose up -d

This command will create the PostgreSQL database using the schema provided in the init-schema file.

Starting the web crawler

Open the project in your preferred IDE (e.g., PyCharm or VSCode).
Install the required dependencies by running pip install -r requirements.txt.
In the WebCrawlerService.py script, you can set the number of threads you want to run in the main function by adjusting the max_workers parameter (e.g., max_workers=10).
Run the program using the IDE's run button or use the command flask --app sample --debug run.
To startthe web crawler, send a POST request to the /scrape endpoint using Postman with the following JSON format: Example:

    {
  "messages": [
            "https://www.gov.si/",
            "https://evem.gov.si/",
            "https://e-uprava.gov.si/",
            "https://www.e-prostor.gov.si/"
  ]
}

This will initiate the web scraping process for the specified websites, and the results will be stored in the PostgreSQL database.

Recommend Projects

lovc21 / web_crawler_kurtz Goto Github PK

web_crawler_kurtz's Introduction

Lt-Colonel-Kilgore-team web crawler

Project description

Features

Necessary programs

Optional programs

Set up the postgres database

Starting the web crawler

web_crawler_kurtz's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent