This repository contains an end-to-end demo on how to use Google BigQuery with public dataset.
Be careful, by clicking on this button, you will run processes on your GCP project. ` This is directly linked to your billing account. The execution of all the following queries represents an amount of $10.
This repository show multiple examples on how to use Google BigQuery to get information for data.
Follow the steps in this guide to execute your own BigQuery requests!
Time to complete: About 15 minutes
Before we jump in, let's briefly go over what Google BigQuery can do.
Google Cloud documentation defines BigQuery like this:
"BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery's serverless architecture lets you use SQL queries to answer your organization's biggest questions with zero infrastructure management. BigQuery's scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes."
You can use Google Cloud Web Console to interact with BigQuery.
To use it in Google Cloud console, visit console.cloud.google.com/bigquery. The opened page should look like this:
The essential parts of this page are:
- The Editor where we will write our SQL requests
- The Query results that display information about execution of our query
- The Explorer to go thought projects, dataset and tables
Google Cloud provides lots of public data directly within BigQuery. You can browse them here.
For the purpose of this demo, we will use a common on, the Wikipedia dataset available in the bigquery-public-data
project.
This dataset provides a number of tables including the page_views
tables which contain the number of times users of
the wikipedia service have clicked on a unique page per hour.
These tables are quite heavy, from 1.5TB to 2.3TB. To simplify our work and limit execution costs, we are going to work on a sub-dataset, limiting it to French users only, like this:
CREATE OR REPLACE TABLE
wikipedia.pageviews_fr_2022
PARTITION BY
DATE(datehour)
CLUSTER BY
title OPTIONS( description="Wikipedia pageviews for France, partition by date and clustured by title",
require_partition_filter=TRUE ) AS
SELECT
*
FROM
`bigquery-public-data.wikipedia.pageviews_2022`
WHERE
DATE(datehour) >= "2022-01-01"
AND wiki = "fr"
With this request, we will create a new table named wikipedia_fr_2022
in a dataset named wikipedia
Make sure you have created the
wikipedia
dataset before running this query otherwise you will get an error.
This request should process 847GB, at the time of this writing, but with some optimisation mechanism used by BigQuery, we only be billed for 24,79GB:
As you can see in the Execution details tab, the request was executed in about 20 seconds but Google has allocated you nearly 7 hours of compute machines! Awesome!
You can adapt the dateset by changing selected dates or wiki.
A simple first request can return the TOP 5 of viewed pages in France. In this request, we'll omit special pages like the homepage.
SELECT
title,
SUM(views) AS total_views
FROM
`wikipedia.pageviews_fr_2022`
WHERE
DATE(datehour) >= "2022-01-01"
AND NOT STARTS_WITH(title, 'Wikipédia:')
AND NOT STARTS_WITH(title, 'Spécial:')
AND wiki = "fr"
GROUP BY
1
ORDER BY
2 DESC
LIMIT
5
The request returns TOP 5 in 3 seconds:
title | total_views |
---|---|
Liste_de_sondages_sur_l'élection_présidentielle_française_de_2022 | 3164436 |
Michael_Jackson | 1587252 |
États-Unis | 1222191 |
Charles_Baudelaire | 1118479 |
France | 1091169 |
The previous request was cool but not enough. BigQuery is an analytical tool. So we want to get information on historical data. To do this, we are going to try to measure the audience impact on a sample of French programs.
SELECT
title,
date(datehour),
views
FROM
`.wikipedia.pageviews_fr_2022`
WHERE
date(datehour) >= '2022-01-01'
AND title IN (
"The_Voice_:_La_Plus_Belle_Voix",
"Koh-Lanta",
"Danse_avec_les_stars",
"L'amour_est_dans_le_pré",
"Mariés_au_premier_regard_(France)",
"Top_Chef"
)
ORDER BY
2 ASC
As you may see in the console, the result is not readable.
To make the data usable, we can use Google Data Studio. It's a free solution for data visualisation integrated with BigQuery. To create a dashboard that present previous result in a human-readable way, you can do:
- In the Query result tab, on the right, click on Explore Data > Explore Data with Data Studio
- A new tab/window should be opened with a default chart : a table
- We will transform this table to a better chart in this case. Select the Smoothed time series chart in the right section.
- Update the Metric to use Views instead of Record Count
You should see something like:
You can see that Wikipedia searches are mostly seasonal and related to the airing of the show.
DO NOT EXECUTE THIS ONE ON YOUR PROJECT
The related cost is too damned high, around $10!
This request is the last one of this demo. I would like to show how BigQuery will handle processing of more that 5TB of data by using all wikipedia views since January 1, 2020, in France.
SELECT
date(datehour) as d,
SUM(views) as v
FROM
`bigquery-public-data.wikipedia.pageviews_202*`
WHERE
date(datehour) >= '2020-01-01'
AND title LIKE "%Covid%"
AND wiki = "fr"
GROUP BY 1
ORDER BY
1 ASC
This query took only 17 seconds to return results! 17s for 5TB! According to the Execution details tab, it would have taken much longer without parallelization (9 hours).
You may ask yourself: "how much cost this kind of request?" and it's normal. The Job Information tab brings a good surprise: we were billed for only 190GB of bytes processed!
If you want to understand why, go back this related story ;)