End-to-end data pipeline to process data from Google Trends.
- Google Cloud Platform (GCP): Google Cloud Storage and BigQuery
- Docker
- Airflow
- dbt
- Python: pandas, pytrends
- Google Data Studio
This work is the final project of Data Engineering Zoomcamp. Ankush, Sejal, Victoria and Alexey: Thank you very much! 😀
We are interested in exploring the relevance and queries related to the word 'audiobook' in Google Explore Bar searches. Actually the official page supports several of our needs, but the (hypothetical) team requires:
- Storage of historical data.
- Ease of access to results from specific locations (Argentina, Germany, Mexico, Spain and United States).
- Linked to the above: The word 'audiobook' needs to be adapted to different languages. "audiolibro" in Spanish, 'Hörbuch' in German, etc.
- Internal Dashboard with automatic information ingestion.
To solve all this we are going to implement the following data pipeline.
The current development obtains the results for the word "audiobook" exclusively. However, you only need to edit the dictionary in the following file to get the results for any other word.
The pipeline works on Airflow with Docker-Compose and the pytrends library. The latter is an unofficial API for Google Trends.
In Airflow we have two dags:
- The first one
data_ingestion_gcs_trends.py
:- Download the interest by region for "audiobook" for each (five) locations.
- Download related queries
- Download related topics
- Add some context information and convert into
.parquet
format (the files aren't big size and there is probably no substantial improvement by implementing this but we do it anyway for rather pedagogical purposes). - Finally upload to Google Cloud Storage
gcs_to_bq_dag.py
:- Moves and refine the order of files and folders.
- Create external tables in Bigquery.
We then transform the loaded data by developing a dbt project.
In dbt we set up two models:
- Staging:
- Separate queries, topics and trends.
- For each table, cast string date to the correct format and include a column with rank value.
- Core:
- Transform geo column to join all tables (this could be done at the beginning but we chose to work with a hypothetical needs case).
- Split the tables into tops and rising (see Google Trends page to visualize this).
- Keep only up to 20 rows in trends, 10 queries/topics and 5 rising.
- Create a specific table with data from our selected countries.
All these transformations are intended to facilitate the future visualization of the results.
Finally we connect Bigquery (core dataset) with Google Studio and make this dashboard.
Some aditionals notes:
- We put comments in all files for more details.
- Dags and dbt scheduled daily.
- Clone the repository.
- Modify values from
docker-compose.yaml
(all relative to GCP credentials) - Complete
.env_example
file with correct information (you need gcp-service-accounts-credentials) - Run
sudo docker compose up
-
Go to localhost:8080 in browser (user: airflow, pass: airflow) and run all dags.
-
Then, for analytcs engineering, you need setup dbt cloud with Bigquery. Please follow these steps.
Check values from these files:
schemas.yml
in staging modeldbt_project.yml
In dbt console run:
dbt deps
dbt run
If all goes OK you can see "staging" and "core" datasets in Bigquery.
- Finally, connect BigQuery and Data Studio with just a few clicks.