Google Cloud - Big Query

API Java

If you plan to use a service account with client library code, you need to set an environment variable:

export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json

Now, we can run a application without specify explicitly credentials

java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
     -Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
     -Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
     bigdata.googlecloud.bigquery.SimpleApp \
     google-analytics 112233445 ga_sessions_intraday_20180103

On the other hand, this option no need to export the env variable but need create credentials inside the code. So json file with credentials is an input argument

java -cp ~/development/googlecloud-bigquery/build/libs/googlecloud-bigquery-1.0-SNAPSHOT.jar \
     -Dhttp.proxyHost=proxy -Dhttp.proxyPort=8080 \
     -Dhttps.proxyHost=proxy -Dhttps.proxyPort=8080 \
     bigdata.googlecloud.bigquery.SimpleAppWithCred \
     ~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json google-analytics 112233445 ga_sessions_intraday_20180103

CLI installation and running

Download gcloud

wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-182.0.0-linux-x86_64.tar.gz

Initialize gcloud

Add to .bashrc

gcloud init

Authorization

Export env variable

export GOOGLE_APPLICATION_CREDENTIALS=~/development/googlecloud-bigquery/src/main/resources/google-analytics-9ca2e8444354.json

Create Access Token

ACCESS_TOKEN="$(gcloud auth application-default print-access-token)"

Metadata

ProjectId	DatasetId	TableId
google-analytics	112233445	ga_sessions_intraday_yyyymmdd

General Endpoints

List of main methods

GET  https://www.googleapis.com/bigquery/v2/projects/projectId/datasets
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/datasets/datasetId/tables/tableId/data
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/jobs
POST https://www.googleapis.com/upload/bigquery/v2/projects/projectId/jobs
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId
POST https://www.googleapis.com/bigquery/v2/projects/projectId/jobs/jobId/cancel
POST https://www.googleapis.com/bigquery/v2/projects/projectId/queries
GET  https://www.googleapis.com/bigquery/v2/projects/projectId/queries/jobId

Examples

GET  https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20171212
GET  https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data
POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries
GET  https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_gRaHSTeIJCmIZzz4bctj0zsuffxc

Google reference

Testing

Tables `list`

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Tables `get`

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103 \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Tabledata `list`

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/datasets/112233445/tables/ga_sessions_intraday_20180103/data \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Jobs `query`

We must to run JOBS. Jobs are asynchronous tasks such as running queries, loading data, and exporting data. Query method run a job and return a job ID.

Request: With hardcode json

curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
     -d '{"query":"select * from ga_sessions_intraday_20180103","kind":"bigquery#queryRequest","maxResults":2,"defaultDataset":{"projectId":"google-analytics","datasetId":"112233445"},"dryRun":false}' \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Request: With file request-body-query.json

curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries \
     -d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-query.json \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $ACCESS_TOKEN"

request-body-query.json looks like:

{
  "query": "SELECT visitorId,visitNumber,visitId,visitStartTime,date,hits.time,hits.hour,hits.minute,hits.referer,hits.page.pagePath FROM ga_sessions_intraday_20180104 OMIT RECORD IF COUNT(hits.hour) < 2 LIMIT 1",
  "kind": "bigquery#queryRequest",
  "maxResults": 2,
  "defaultDataset": {
    "projectId": "google-analytics",
    "datasetId": "112233445"
  },
  "dryRun": false,
  "useLegacySql": true
}

Response: If we don't set timeout parameter probably the query does not return data, but it return a jobId to use in getQueryResults method.

{
 "kind": "bigquery#queryResponse",
 "jobReference": {
  "projectId": "google-analytics",
  "jobId": "job_RvsOhR4xxWHk43eKShWcqm5a0srh"
 },
 "jobComplete": false
}

Jobs `get` (job status)

Get the status of a specific job

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs/job_35axds8rFh5v6x_ztzwCNIxvIR9- \
     -H "Authorization: Bearer $ACCESS_TOKEN"

One of the output fields in response body is status.state. It can be PENDING state, RUNNING or DONE.

Response has fields with destination table in BigQuery where result data has been inserted:

...
"destinationTable": {
    "projectId": "google-analytics",
    "datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
    "tableId": "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
}
...

Jobs `getQueryResults`

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/queries/job_RvsOhR4xxWHk43eKShWcqm5a0srh/timeoutMs/60000 \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Response looks like this getQueryResults json body

Jobs `insert` (extract data)

To extract data about specific JobId we must to use Google Cloud Storage.

Request: With file request-body-insert.json

curl -X POST https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
     -d @/home/angelrojo/development/googlecloud-bigquery/src/main/resources/request-body-insert.json \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $ACCESS_TOKEN"

request-body-insert.json looks like:

{
  "jobReference": {
    "projectId": "google-analytics",
    "jobId": "custom-jobId-csv-20170104_1200"
  },
  "configuration": {
    "extract": {
      "sourceTable": {
        "projectId": "google-analytics",
        "datasetId": "_9dbcf0c4fbdc437de6df08d8d7d353cd7888b31c",
        "tableId":   "anonf7bb7b98737fb464a850db7bf8d5a0e3eacdc6de"
      },
      "destinationUris": ["gs://google-analytics/extraction-0001-ga_sessions_intraday_20180104.csv"],
      "destinationFormat": "CSV",
      "compression": "NONE"
    }
  }
}

Jobs `list`

curl -X GET https://www.googleapis.com/bigquery/v2/projects/google-analytics/jobs \
     -H "Authorization: Bearer $ACCESS_TOKEN"

Example of workflow

Export de la variable GOOGLE_APPLICATION_CREDENTIALS contiene el path del json con credenciales
Obtener ACCESS_TOKEN
Llamar a query recoger el jobID -> job_35axds8rFh5v6x_ztzwCNIxvIR9-
Llamar a get para el job status y recoger del body response los campos jsodatasetId y tableId del objeto destinationTable
Llamar a insert para hacer el extract, pasandole el jobId custom, el datasetId y tableId anteriores. Esto escribe un fichero en el formato indicado, en este caso CSV, en un bucket de GCS (Google Cloud Storage). Hay que parametrizar el nombre del fichero csv para que sea distinto por cada llamada a BigQuery
Llamar al API Rest de Google Cloud Storage

ilittleangel / googlecloud-bigquery Goto Github PK