Git Product home page Git Product logo

google-reviews's Introduction

Google Reviews scraping

Scraping scripts for various service related data sources with dockerized selenium infrastructure

Setup

Inside the root folder of the application run $ docker-compose up -d. This will set up selenium grid as well as currently one node for both Firefox and Chrome.

Scraping

For automatic usage of selenium scrapy calls are wrapped by the entry- point crawl, used like following:

$ docker-compose run google-reviews crawl --company_name="aspria berlin ku'damm"

or

$ docker-compose run google-reviews crawl --company_name="Granvalora Limburg"

Note: The limitation of this script is that, It's currently scrolling only into the first page of reviews, I need to improve the code for scrolling further.

But one feature, which I has implemented, is that if there are multiple companies with the same name, it will automatically take the first company and get it's reviews.

you can see the scraped reviews, in ArangoDB. you can access it on

http://localhost:8529/

username and password can be found in env file inside compose directory. and the database is google-reviews.


Testing ETL Google Reviews.

for the sake of testing purpose, I had loaded the selenium driver, and open the reviews of "aspria berlin ku'damm", and manually scrolled down to load multiple reviews requests, and get all the content of those requests, and stored it in files, which you can see in file directory, and then inside input folder.

You can perform the ETL task, by Extracting raw review content. Transform transforming the review's raw content in review's data classes. Loading loading the transformed data as CSV in the files output folder.

you can just simply use the following command.

python -m virtualenv .venv

source .venv\bin\activate

pip install notebook

cd notebooks

jupyter notebook

it will load the jupyter notebook, you can test the ETL pipeline there.

google-reviews's People

Contributors

saudbinhabib avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.