satrancci / web-crawler-with-proxies Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 193 KB

A web crawler for Vrbo.com that uses Crawlera API and Hotspot Shield VPN as proxy services.

crawling-python vrbo proxies hotspotshield crawlera

web-crawler-with-proxies's People

Contributors

Stargazers

Watchers

web-crawler-with-proxies's Issues

Modify run.py to iterate over every file in routes_to_crawl

After #21, we now have a selenium crawler that gets valid room IDs that are stored in /routes_to_crawl directory. We will use those room IDs to pass into the curl command.

Integrate the program with 10 hard-coded proxies

Before embarking on a fully automated crawling process using proxies, it makes sense to test the program on the hard-coded values. These ten hard-coded proxies will be manually retrieved from https://sslproxies.org/. In the future, the goal is to create a separate crawler/use existing library to retrieve valid proxies automatically from https://sslproxies.org/.

Add parser that iterates over crawled_routes directory

Add parser that iterates over crawled_routes directory, parses html files and writes city, price per row to a txt file

Refactor and fix bugs in crawler_hotspot.py

crawler_hotspot.py needs to be improved. Current problems:

hotspot_connect_random() returns successfully even when in disconnected state. More thorough error checking needs to be done. Also, it needs to be split into separate functions for disconnect and connect.
refactor crawl_with_hotspot_shield()

Plot 1 room

Plot data for the price retrieved from #2

Add run_plotting.py

Requirements:

must iterate over parsed_data.txt, which has the following format: CITY, PRICE per row, and plot CDF: i) for each city
ii) multiple cities in one plot; user can pass a txt file with the list of cities to appear on the same plot from command line.

Replace hard-coded proxies with the proxies' crawler

Use functions from proxies_crawler.py to fetch and parse free proxies from https://sslproxies.org/ in lieu of hard-coded proxies.

Create a pipeline for 10 hard-coded room IDs

Now that we have basic functionality, it is time to start putting everything (crawling, parsing and visualizing) together. This issue aims to build a basic pipeline that automatically crawls, parses and visualizes 10 Vrbo rooms.

Use Selenium to get valid room IDs

To make sure that crawling with curl is not blind, we need to get valid room IDs from somewhere. A good tool might be Selenium.

share code

Hi Alex, your work is excellent. I am a doctoral researcher at the University of Helsinki (https://researchportal.helsinki.fi/en/persons/thanh-tung-vuong). My research would require an implementation of a simple search system. For instance, the system would crawl and index content from some health and medical sites for a conversational agent. Because of such a purpose, would you be so kind to share the code?

a crawler for free proxies

Add a service to crawl https://sslproxies.org/ every n minutes and return a list of proxies.

Crawl 1 room

Design a curl command that crawls and stores data for 1 room from Vrbo.com.

Add cookies to the curl command

We need to add cookies in order to minimize the risk of being blocked by the website that we crawl.

pass a list of locations to selenium_crawler.py

Instead of passing a single city as argument to selenium_crawler.py, modify it to take in a txt file with the list of locations from command line

integrate hotspot shield cli with selenium_crawler.py

After #25, selenium_crawler.py can now take in a txt file (e.g. locations_to_crawl.txt) with the list of locations to crawl. To reduce the chance of being blocked by the website, we must not only sleep but also use proxies (like in run.py). One idea is to interact with Hotspot Shield CLI using subprocess module, using different proxy after crawling data on one location (e.g. 'barcelona') from locations_to_crawl.txt

Prepare repo to go public

Add requirements.txt
Add .sh script that will create empty directories, create virtualenv, .env, install dependencies, etc
Update README (installation steps, explanation of the workflow, limitations, license, anything else?)

integrate with Hotspot Shield VPN CLI

Since Crawlera's free trial provides only 10k requests and there are millions of pages to crawl from Vrbo.com, we need some other proxy service. https://www.hotspotshield.com/vpn/vpn-for-linux may be exactly what we need.

Parse 1 room

Parse data from #1 and retrieve price for that room.

Parse data more intelligently

Currently, parsing begins only after the crawling step is done. However, if we crawl hundreds of thousands(or millions) of pages and store them to disk, we may run out of disk space before we even complete the crawling procedure. Thus, I need to modify crawler.py so that it crawls and then immediately parses each page and then only writes to disk information about successful pages (e.g. those that are valid rooms), rather than any html object, as it is now.

satrancci / web-crawler-with-proxies Goto Github PK

web-crawler-with-proxies's People

Contributors

Stargazers

Watchers

web-crawler-with-proxies's Issues

Recommend Projects

Recommend Topics

Recommend Org