Git Product home page Git Product logo

web-crawler-with-proxies's People

Contributors

satrancci avatar

Stargazers

 avatar  avatar

Watchers

 avatar

web-crawler-with-proxies's Issues

Refactor and fix bugs in crawler_hotspot.py

crawler_hotspot.py needs to be improved. Current problems:

  • hotspot_connect_random() returns successfully even when in disconnected state. More thorough error checking needs to be done. Also, it needs to be split into separate functions for disconnect and connect.
  • refactor crawl_with_hotspot_shield()

Add run_plotting.py

Requirements:

must iterate over parsed_data.txt, which has the following format: CITY, PRICE per row, and plot CDF: i) for each city
ii) multiple cities in one plot; user can pass a txt file with the list of cities to appear on the same plot from command line.

Create a pipeline for 10 hard-coded room IDs

Now that we have basic functionality, it is time to start putting everything (crawling, parsing and visualizing) together. This issue aims to build a basic pipeline that automatically crawls, parses and visualizes 10 Vrbo rooms.

share code

Hi Alex, your work is excellent. I am a doctoral researcher at the University of Helsinki (https://researchportal.helsinki.fi/en/persons/thanh-tung-vuong). My research would require an implementation of a simple search system. For instance, the system would crawl and index content from some health and medical sites for a conversational agent. Because of such a purpose, would you be so kind to share the code?

integrate hotspot shield cli with selenium_crawler.py

After #25, selenium_crawler.py can now take in a txt file (e.g. locations_to_crawl.txt) with the list of locations to crawl. To reduce the chance of being blocked by the website, we must not only sleep but also use proxies (like in run.py). One idea is to interact with Hotspot Shield CLI using subprocess module, using different proxy after crawling data on one location (e.g. 'barcelona') from locations_to_crawl.txt

Prepare repo to go public

  • Add requirements.txt
  • Add .sh script that will create empty directories, create virtualenv, .env, install dependencies, etc
  • Update README (installation steps, explanation of the workflow, limitations, license, anything else?)

Parse data more intelligently

Currently, parsing begins only after the crawling step is done. However, if we crawl hundreds of thousands(or millions) of pages and store them to disk, we may run out of disk space before we even complete the crawling procedure. Thus, I need to modify crawler.py so that it crawls and then immediately parses each page and then only writes to disk information about successful pages (e.g. those that are valid rooms), rather than any html object, as it is now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.