If you plan to use a service account with client library code, you need to set an environment variable:
export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json
Now, we can run a application without specify explicitly credentials
java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
-Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
-Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
bigdata.googlecloud.bigquery.SimpleApp \
google-analytics 112233445 ga_sessions_intraday_20180103
On the other hand, this option no need to export the env variable but need create credentials inside the code. So json file with credentials is an input argument
java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
-Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
-Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
bigdata.googlecloud.bigquery.SimpleAppWithCred \
~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json google-analytics 112233445 ga_sessions_intraday_20180103
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-182.0.0-linux-x86_64.tar.gz
Add to .bashrc
gcloud init
export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json
ACCESS_TOKEN="$(gcloud auth application-default print-access-token)"
ProjectId | DatasetId | TableId |
---|---|---|
google-analytics | 112233445 | ga_sessions_intraday_yyyymmdd |
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables
GET https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId/data
GET https://www.googleapis.com/bigquery/v2/projects/projectId/jobs
POST https://www.googleapis.com/upload/bigquery/v2/projects/projectId/jobs
GET https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId
POST https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId/cancel
POST https://www.googleapis.com/bigquery/v2/projects/projectId/queries
GET https://www.googleapis.com/bigquery/v2/projects/projectId/queries/jobId
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20171212
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data
POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries
GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_gRaHSTeIJCmIZzz4bctj0zsuffxc
- Datasets:
get
- Datasets:
list
- Tables:
get
- Tables:
list
- Tabledata:
list
- Jobs:
query
- Jobs:
getQueryResults
- Jobs:
get
- Jobs:
list
- Jobs:
insert
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables \
-H "Authorization: Bearer $ACCESS_TOKEN"
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103 \
-H "Authorization: Bearer $ACCESS_TOKEN"
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data \
-H "Authorization: Bearer $ACCESS_TOKEN"
We must to run JOBS. Jobs are asynchronous tasks such as running queries, loading data, and exporting data. Query
method run a job and return a job ID.
Request: With hardcode json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
-d '{"query":"select * from ga_sessions_intraday_20180103","kind":"bigquery#queryRequest","maxResults":2,"defaultDataset":{"projectId":"google-analytics","datasetId":"112233445"},"dryRun":false}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
Request: With file request-body-query.json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
-d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-query.json \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
request-body-query.json
looks like:
{
"query": "SELECT visitorId,visitNumber,visitId,visitStartTime,date,hits.time,hits.hour,hits.minute,hits.referer,hits.page.pagePath FROM ga_sessions_intraday_20180104 OMIT RECORD IF COUNT(hits.hour) < 2 LIMIT 1",
"kind": "bigquery#queryRequest",
"maxResults": 2,
"defaultDataset": {
"projectId": "google-analytics",
"datasetId": "112233445"
},
"dryRun": false,
"useLegacySql": true
}
Response: If we don't set timeout
parameter probably the query does not return data, but it return a jobId
to use in getQueryResults
method.
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "google-analytics",
"jobId": "job_RvsOhR4xxWHk43eKShWcqm5a0srh"
},
"jobComplete": false
}
Get the status of a specific job
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs/job_35axds8rFh5v6x_ztzwCNIxvIR9- \
-H "Authorization: Bearer $ACCESS_TOKEN"
One of the output fields in response body is status.state
. It can be PENDING state, RUNNING or DONE.
Response has fields with destination table in BigQuery where result data has been inserted:
...
"destinationTable": {
"projectId": "google-analytics",
"datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
"tableId": "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
}
...
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_RvsOhR4xxWHk43eKShWcqm5a0srh/timeoutMs/60000 \
-H "Authorization: Bearer $ACCESS_TOKEN"
Response looks like this getQueryResults json body
To extract data about specific JobId we must to use Google Cloud Storage.
Request: With file request-body-insert.json
curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
-d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-insert.json \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN"
request-body-insert.json
looks like:
{
"jobReference": {
"projectId": "google-analytics",
"jobId": "custom-jobId-csv-20170104_1200"
},
"configuration": {
"extract": {
"sourceTable": {
"projectId": "google-analytics",
"datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
"tableId": "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
},
"destinationUris": ["gs://google-analytics/extraction-0001-ga_sessions_intraday_20180104.csv"],
"destinationFormat": "CSV",
"compression": "NONE"
}
}
}
curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
-H "Authorization: Bearer $ACCESS_TOKEN"
- Export de la variable
GOOGLE_APPLICATION_CREDENTIALS
contiene el path del json con credenciales - Obtener
ACCESS_TOKEN
- Llamar a
query
recoger el jobID ->job_35axds8rFh5v6x_ztzwCNIxvIR9-
- Llamar a
get
para el job status y recoger del body response los camposjsodatasetId
ytableId
del objetodestinationTable
- Llamar a
insert
para hacer el extract, pasandole eljobId
custom, eldatasetId
ytableId
anteriores. Esto escribe un fichero en el formato indicado, en este caso CSV, en un bucket de GCS (Google Cloud Storage). Hay que parametrizar el nombre del fichero csv para que sea distinto por cada llamada a BigQuery - Llamar al API Rest de Google Cloud Storage