satrancci / web-crawler-with-proxies Goto Github PK
View Code? Open in Web Editor NEWA web crawler for Vrbo.com that uses Crawlera API and Hotspot Shield VPN as proxy services.
A web crawler for Vrbo.com that uses Crawlera API and Hotspot Shield VPN as proxy services.
After #21, we now have a selenium crawler that gets valid room IDs that are stored in /routes_to_crawl
directory. We will use those room IDs to pass into the curl
command.
Before embarking on a fully automated crawling process using proxies, it makes sense to test the program on the hard-coded values. These ten hard-coded proxies will be manually retrieved from https://sslproxies.org/. In the future, the goal is to create a separate crawler/use existing library to retrieve valid proxies automatically from https://sslproxies.org/.
Add parser that iterates over crawled_routes
directory, parses html
files and writes city
, price
per row to a txt
file
crawler_hotspot.py
needs to be improved. Current problems:
hotspot_connect_random()
returns successfully even when in disconnected state. More thorough error checking needs to be done. Also, it needs to be split into separate functions for disconnect and connect.crawl_with_hotspot_shield()
Plot data for the price retrieved from #2
Requirements:
must iterate over parsed_data.txt
, which has the following format: CITY
, PRICE
per row, and plot CDF: i) for each city
ii) multiple cities in one plot; user can pass a txt file with the list of cities to appear on the same plot from command line.
Use functions from proxies_crawler.py
to fetch and parse free proxies from https://sslproxies.org/ in lieu of hard-coded proxies.
Now that we have basic functionality, it is time to start putting everything (crawling, parsing and visualizing) together. This issue aims to build a basic pipeline that automatically crawls, parses and visualizes 10 Vrbo rooms.
To make sure that crawling with curl is not blind, we need to get valid room IDs from somewhere. A good tool might be Selenium.
Hi Alex, your work is excellent. I am a doctoral researcher at the University of Helsinki (https://researchportal.helsinki.fi/en/persons/thanh-tung-vuong). My research would require an implementation of a simple search system. For instance, the system would crawl and index content from some health and medical sites for a conversational agent. Because of such a purpose, would you be so kind to share the code?
Add a service to crawl https://sslproxies.org/ every n minutes and return a list of proxies.
Design a curl command that crawls and stores data for 1 room from Vrbo.com.
We need to add cookies in order to minimize the risk of being blocked by the website that we crawl.
Instead of passing a single city as argument to selenium_crawler.py
, modify it to take in a txt file with the list of locations from command line
After #25, selenium_crawler.py
can now take in a txt file (e.g. locations_to_crawl.txt
) with the list of locations to crawl. To reduce the chance of being blocked by the website, we must not only sleep but also use proxies (like in run.py
). One idea is to interact with Hotspot Shield CLI using subprocess
module, using different proxy after crawling data on one location (e.g. 'barcelona') from locations_to_crawl.txt
requirements.txt
.sh
script that will create empty directories, create virtualenv, .env, install dependencies, etcSince Crawlera's free trial provides only 10k requests and there are millions of pages to crawl from Vrbo.com, we need some other proxy service. https://www.hotspotshield.com/vpn/vpn-for-linux may be exactly what we need.
Parse data from #1 and retrieve price for that room.
Currently, parsing begins only after the crawling step is done. However, if we crawl hundreds of thousands(or millions) of pages and store them to disk, we may run out of disk space before we even complete the crawling procedure. Thus, I need to modify crawler.py
so that it crawls and then immediately parses each page and then only writes to disk information about successful pages (e.g. those that are valid rooms), rather than any html object, as it is now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.