Git Product home page Git Product logo

rbnbr / livewebsitescraper Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 41 KB

A exemplary implementation of how to do webscraping for websites that have fast-changing data coming in that can't be accesed by e.g., a direct websocket connection, and which' retrieval requires the active monitoring of the website. This repository proposes an exemplary implementation of a solution to this problem using Python and Selenium.

License: MIT License

Dockerfile 4.28% Python 92.87% JavaScript 2.85%

livewebsitescraper's Introduction

Live Website Scraper

An exemplary implementation of how to do webscraping for websites that display fast-changing data which cannot be accessed by e.g., a direct websocket connection to their API, but which' retrieval requires the active monitoring of the website. This repository proposes an exemplary implementation of a solution to this problem using Python and Selenium.

Note

The code and the concepts used in this repository stem originally from another bigger project, and may contain some artifacts from it. I tried to prune it to contain only some of the core principles and utilities which should be generalizable for arbitrary websites.

Be aware that some concepts and ideas I had are probably not sufficiently implemented, and also may still contain bugs. Please feel free to suggest improvements via pull requests or issues.

For now, this repository does not serve as a library or package to be imported, but as an example that has to be adjusted for new use cases.

Concept

In contrast to the standard data scraping approaches described in the web which usually rely on regular pulling intervals to retrieve mostly static data from a website, this concept focuses on dynamic websites which live updates. The idea is that changes in the website are actively monitored and new data is directly being pushed to the application via an open connection instead of being pulled by it.

There are four steps involved:

  1. Leverage Selenium (or other software to control a browser) to open the live updated webpage to monitor.
  2. Run a WebSocket server which should collect and further process the data
  3. Inject JavaScript to the webpage via, e.g., Selenium (step 1) which does the following:
    • Create a WebSocket object and open a connection to the WebSocket Server (from step 2)
    • Create MutationObserver object(s) to monitor the data to be retrieved, with a callback, which sends the new data to the WebSocket
  4. Process the data in the application, e.g., by aggregating specific values and storing them in a database.

In general, it is quite simple and its implementation here is much more complex than necessary, which is due to the projects' origin as described above.

A minimal example of this concept can be found here: minimal_example.py.

The steps to run the minimal example are the following:

pip install webdriver-manager selenium websockets
python minimal_example.py

You may use and adjust the minimal example as required. Alternatively, you may also try to run the provided project which already implements additional features for more robustness and the management of many tabs.

Additional Features Of The Provided Project

The project contains additional utilities that I use in the original project:

  • Classes which try to abstract from the selenium driver object to..
    • Handle a single page
    • Handle a single remote selenium instance (which in itself handle multiple pages)
    • Handle multiple drivers (remote selenium instances)
  • A run loop to
    • Update tabs (in case some tabs were closed unexpectedly, or the webpages we want to monitor depend on other dynamic factors, conditions, ...)
    • Check health status of pages (and issue page refresh if unhealthy)
    • Simulate activity
    • Bundle collected data and commonly process it by, e.g., pushing it to a database

Running The Project

The current project collects artificial data from my github.io page: https://rbnbr.github.io/random-signal. You can run the project via docker compose: docker compose up (optionally -d).

The docker compose file also expects two secrets, though, since they are not leveraged at the moment, it is enough to create two empty files in the appropriate location for that:

mkdir secure
mkdir secure/tls-ssl
echo "" > secure/postgres_credentials.txt
echo "" > secure/tls-ssl/cert.key

If you want to run it locally, make sure to set the environment variables accordingly. See the settings in the docker-compose.yml and Dockerfile for a description of the environment variables.

Some settings are also configured via the configuration files default_config.json and default_debug_config.json.

Adjusting The Project

To adjust the project to monitor a different website you most likely have to do the following steps:

  • Create a new page monitor that handles opening the website (page), refresh actions, health check, simulate activity, and first processing of the incoming data. You may use the existing RandomSignalPageMonitor as reference.
  • Create a new injection script corresponding to the new website layout and the data format the page monitor expects. You may use the existing RandomSignalInjectionScript for reference.
  • If necessary, create a new data element for structured access to the data sent, see example SignalElement.
  • Adjust the monitor manager (or inherit) to open the correct tabs by updating (or overriding) the update_tabs method.
    • Also configure the push_results method accordingly to handle the new type of data from the new page manager.

livewebsitescraper's People

Contributors

rbnbr avatar

Stargazers

Kerem Nalbant avatar philsv avatar

Watchers

 avatar

Forkers

lukman-fun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.