Git Product home page Git Product logo

gt_monitor's Introduction

Google Trends Monitor

End-to-end data pipeline to process data from Google Trends.

Technologies

  • Google Cloud Platform (GCP): Google Cloud Storage and BigQuery
  • Docker
  • Airflow
  • dbt
  • Python: pandas, pytrends
  • Google Data Studio

This work is the final project of Data Engineering Zoomcamp. Ankush, Sejal, Victoria and Alexey: Thank you very much! 😀

Problem Description

We are interested in exploring the relevance and queries related to the word 'audiobook' in Google Explore Bar searches. Actually the official page supports several of our needs, but the (hypothetical) team requires:

  • Storage of historical data.
  • Ease of access to results from specific locations (Argentina, Germany, Mexico, Spain and United States).
  • Linked to the above: The word 'audiobook' needs to be adapted to different languages. "audiolibro" in Spanish, 'Hörbuch' in German, etc.
  • Internal Dashboard with automatic information ingestion.

To solve all this we are going to implement the following data pipeline.

Data Pipeline

The current development obtains the results for the word "audiobook" exclusively. However, you only need to edit the dictionary in the following file to get the results for any other word.

The pipeline works on Airflow with Docker-Compose and the pytrends library. The latter is an unofficial API for Google Trends.

Airflow

In Airflow we have two dags:

  • The first one data_ingestion_gcs_trends.py:
    • Download the interest by region for "audiobook" for each (five) locations.
    • Download related queries
    • Download related topics
    • Add some context information and convert into .parquet format (the files aren't big size and there is probably no substantial improvement by implementing this but we do it anyway for rather pedagogical purposes).
    • Finally upload to Google Cloud Storage
  • gcs_to_bq_dag.py:
    • Moves and refine the order of files and folders.
    • Create external tables in Bigquery.

dbt

We then transform the loaded data by developing a dbt project.

In dbt we set up two models:

  • Staging:
    • Separate queries, topics and trends.
    • For each table, cast string date to the correct format and include a column with rank value.
  • Core:
    • Transform geo column to join all tables (this could be done at the beginning but we chose to work with a hypothetical needs case).
    • Split the tables into tops and rising (see Google Trends page to visualize this).
    • Keep only up to 20 rows in trends, 10 queries/topics and 5 rising.
    • Create a specific table with data from our selected countries.

All these transformations are intended to facilitate the future visualization of the results.

Google Studio

Finally we connect Bigquery (core dataset) with Google Studio and make this dashboard.

Some aditionals notes:

  • We put comments in all files for more details.
  • Dags and dbt scheduled daily.

Instructions to Run

  1. Clone the repository.
  2. Modify values from docker-compose.yaml (all relative to GCP credentials)
  3. Complete .env_example file with correct information (you need gcp-service-accounts-credentials)
  4. Run

sudo docker compose up

  1. Go to localhost:8080 in browser (user: airflow, pass: airflow) and run all dags.

  2. Then, for analytcs engineering, you need setup dbt cloud with Bigquery. Please follow these steps.

Check values from these files:

  • schemas.yml in staging model
  • dbt_project.yml

In dbt console run: dbt deps dbt run

If all goes OK you can see "staging" and "core" datasets in Bigquery.

  1. Finally, connect BigQuery and Data Studio with just a few clicks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.