Git Product home page Git Product logo

vfedotovs / sslv_web_scraper Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 3.0 642 KB

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database

License: GNU General Public License v3.0

Python 83.82% Dockerfile 0.57% Makefile 2.61% Shell 12.99%
python webscraping email-sender docker fpdf-library pandas-library sendgrid-api beautifulsoup4 requests email

sslv_web_scraper's Introduction

SS.LV Web Scraper

Test, Build and Deploy CI
Coverage codecov
Embark on an exploration of Ogre City apartments for sale historical data here http://propertydata.lv/

About application:

Purpose: This application will scrape daily ss.lv website from apartments for sale category in specific city of your choice and store scraped data in postgres database and will send daily email with report.

Requirements

# docker -v                                                                 
Docker version 20.10.11, build dea9396

# docker-compose -v                                                                  
Docker Compose version v2.2.1

How to use application:

  1. Clone repository
  2. Create database.ini here is example
[postgresql]
host=<your docker db hostname>
database=<your db name>
user=<your db username>
password=<your db password>
  1. Create .env.prod file for docker compose
# ws_worker container envs
[email protected]
SENDGRID_API_KEY=<Your SENDGRID API Key>
[email protected]
POSTGRES_PASSWORD=<Your DB Password>
  1. Run docker-compose --env-file .env.prod up -d

Use make

make                                                                          
help                 ๐Ÿ’ฌ This help message
all                  runs setup, build and up targets
setup                gets database.ini and .env.prod and dowloads last DB bacukp file
build                builds all containers
up                   starts all containers
down                 stops all containers
clean                removes setup and DB files and folders
lt                   Lists tables sizes in postgres docker allows to test if DB dump was restored correctly

Currently available features

  • Scrape ss.lv website to extract advert data from Ogre city apartments for sale section
  • Store scraped data in postgres database container tables listed_ads and removed_ads for tracking longer price trends
  • Daily email is sent which includes advert URLs and key data categorized by room count
  • Email contains pdf attachment with basic price analytics for categorized by room count
  • Fully automated deployment for dev branche with Github Actions CICD to AWS EC2
  • Add tests and test coverage step in CICD and in README.md
  • Add WEB service functionality for data explore using Pygwalker and Streamlit

Worok in progress:

  • Add Streamlit web service to CICD
  • Add doc and doc coverage step in CICD and in README.md

sslv_web_scraper's People

Contributors

vfedotovs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

sslv_web_scraper's Issues

FEAT: Add Makefile

Create makefile for shortcuts like:
-- docker compose up
-- docker clean unused images
-- pytest
-- flake8
-- docker build and push
-- deploy to AWS

BUG container keeps crashing if inserted ad time length is less than 24 hours

File "", line 219, in _call_with_frames_removed
File "/./app/main.py", line 7, in
from app.wsmodules.db_worker import db_worker_main
File "/./app/wsmodules/db_worker.py", line 547, in
db_worker_main()
File "/./app/wsmodules/db_worker.py", line 75, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/./app/wsmodules/db_worker.py", line 459, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '19:42:31.992576'

Implement logging for db_worker module

requirements :

  • should create log file named db_worker.log
  • log includes:
    • timestamp
    • source module, function and line num, message
    • should rotate logs max size 1 MB history 5 files

Implement 3 features for MVP

  • Implement basic web scraper functionality
  • Implement data formatting feature for basic text email
  • Implement send formatted text data as daily email functionality

Restore daily email functionality with report in attachment

Goal daily mail must sent via Sendgrid email API

Subgoals:

  1. Text report should be included in email (limit to single and double room apartments)
  2. Analytics module should run and create pdf report
  3. Email should include pdf report in attachment and text ( no limiting to single or double room apts)

Reduce log event count in task scheduler 10 times

2022-01-16 21:18:12,243 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:23:12,327 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:28:12,425 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...

From once in 5 min to once in 50 min = 3000 sec

Fix bug FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'

./db_worker.py
DEBUG: Satrting db_worker module ...
DEBUG: Loaded cleaned-sorted-df.csv to dataframe in memory ...
Traceback (most recent call last):
File "./db_worker.py", line 145, in
db_worker_main()
File "./db_worker.py", line 36, in db_worker_main
df_hashes = get_data_frame_hashes('cleaned-sorted-df.csv')
File "./db_worker.py", line 54, in get_data_frame_hashes
df = pd.read_csv(df_filename)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'

Refactor duplicate function call in db_worker.py

2021-07-23 17:13:28,419: db_worker: INFO: db_worker_main: 45: --- Satrting db_worker module ---
2021-07-23 17:13:28,421: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,454: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table
2021-07-23 17:13:28,454: db_worker: INFO: categorize_hashes: 148: Categorizing hashes based on listed_ads table hashes and and new df hashes
2021-07-23 17:13:28,456: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,467: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table

Fix incorrect handling insert new messages that are less that 1 day old

Traceback (most recent call last):
File "./app.py", line 18, in
import db_worker
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 549, in
db_worker_main()
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 74, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 461, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '21:53:32.177067'

Categorize hases in 3 categories

Categorize hases in 3 categories

  • new hashes (for insert to listed_ads table)
  • seen hashes but not delisted yet (increment listed days value)
  • delisted hashes (for insert to delisted_ads and remove from listed_ads table)

FEAT: Implement logging for formater , cleaner, file_remover

INFO: Started server process [1]

From ws DOCKER console log
Debug info: Started dat_formater module ... <<< need to improve
Debug info: Ended data_formater module ...
Debug info: Starting data frame cleaning module ...
Debug info: Completed dat_formater module ...
Error: 1_rooms_tmp.txt : No such file or directory
Error: Mailer_report.txt : No such file or directory
Error: basic_price_stats.txt : No such file or directory
Error: 1-4_rooms.png : No such file or directory
Error: 1_rooms.png : No such file or directory
Error: 2_rooms.png : No such file or directory
Error: test.png : No such file or directory
Error: mrv2.txt : No such file or directory
Error: Ogre_city_report.pdf : No such file or directory
2022-01-06 16:16:26,071 [MainThread ] [INFO ] : serve: 84: Started server process [1]

BUG: ERROR: Exception in ASGI application in web_scraper container

INFO: 192.168.176.1:57528 - "GET / HTTP/1.1" 200 OK
INFO: 192.168.176.1:57528 - "GET /favicon.ico HTTP/1.1" 404 Not Found
DEBUG: sleeping 90 sec ... waiting for srape task to complete
DEBUG: sleeping 5 sec .. waiting for dataformater task to complete
DEBUG: sleeping 3 sec
DEBUG: sleeping 5 sec
INFO: 192.168.176.4:43588 - "GET /run-task/ogre HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/applications.py", line 112, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 64, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 142, in call
await self.background()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 35, in call
await task()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 20, in call
await run_in_threadpool(self.func, *self.args, **self.kwargs)
File "/usr/local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "/./app/wsmodules/web_scraper.py", line 57, in scrape_website
exit(0)
File "/usr/local/lib/python3.8/_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: 0
Debug info: Starting website parsing module ...
Checking if job: scrape OGRE apartments has run today
Job did run today state: True
--- Finished ws_worker module because job was run today state: true ---

Add debug logging of all new messages in dbworker.log

db_worker: INFO: compare_df_to_db_hashes: 164: Result 15 new, 29 still_listed, 1 to_remove hashes
need to write which exactly where 15 new ads

It seems suspicious that there where 15 new ads but I dont have 15 with the same listed date could be that they got edited but did not get listed the same day

This could be triaged by comparing 2 day database backups

Fix issue table insert fails for all rows if listed days value is below 1 day in hours

2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 15
2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 19
2021-07-25 23:17:30,882: db_worker: INFO: insert_data_to_listed_table: 236: 23:17:30.748172
2021-07-25 23:17:30,882: db_worker: ERROR: insert_data_to_listed_table: 262: invalid input syntax for integer: "23:17:30.748172"
LINE 12: ...00, 35.0, 971.43, 'Skolas iela 1b', '2021.07.25', '23:17:30....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.