Git Product home page Git Product logo

sa360-webquery-bigquery's Introduction

Total alerts Language grade: Java

SA360 WebQuery to BigQuery Exporter

Background

Large SA360 customers want to build custom reports by combining their 1st party data with paid-search data.

The Easiest way to achieve that is by combining the data in BigQuery. There are two ways to programmatically import SA360 data into BigQuery

  1. API
  2. Web Query

WebQuery makes reporting easier compared to the API (with less steps), as it allows adding additional entity data (e.g. Business data) in the report, which makes the report simple as compared to an API where this stitching has to be done in a user's program.

How does it work?

The tool uses Service account credentials to extract Reports from SA360 and also send data into BigQuery. First the system extracts the Report (in XML format) from SA360 and converts it into CSV on the fly (using SAX parsing), this file is then staged (copied) to GCS, followed by calling BQ API to load the data into a separate table.

Support for User account credentials has been dropped due to security issues arising from storing user credentials locally.

Video Tutorial

Part-1

Usage

Create Google Cloud Project

  1. Create a Google Cloud Project and enable billing.
  2. Enable APIs.
  3. Create a Service Account to be used for running this project.
  4. Grant the service account permissions
    1. SA360 permission to retrieve webquery report
    2. BigQuery and Storage permission
  5. Set Variables
    export PROJECT_ID="<google-cloud-project-id>"
    export GCS_BUCKET_NAME="<name-of-cloud-storage-bucket>"
    export BIGQUERY_DATASET_NAME="<name-of-dataset>"
    export SERVICE_ACCOUNT_NAME="<service-account-name>"
    export COMPUTE_ENGINE_INSTANCE_NAME="<compute-instance-name>"    

Create Resources

  1. Create Cloud Storage bucket to stage the reports.

    gsutil mb -p ${PROJECT_ID} "gs://${GCS_BUCKET_NAME}" 
  2. Crate BigQuery dataset to store the reports

    bq mk --project_id="${PROJECT_ID}" ${BIGQUERY_DATASET_NAME}
  3. Create Service account

    gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME} \
    --description="Service account to run SA360 webquery to BigQuery"    
  4. Create Compute Engine instance, Set default zone and region

    gcloud compute instances create ${COMPUTE_ENGINE_INSTANCE_NAME} \
    --service-account="${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" \
    --scopes=https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/doubleclicksearch \
    --image-project debian-cloud \
    --image-family debian-10

    Ensure that the user/serviceAccount has at least READ permissions for SA360 and EDITOR Permissions for BigQuery.

Compile and run

  1. Create a Configuration file (csv) with specified headers. (consider sample-config.csv as a reference)

  2. Compile and package source code into an executable JAR.

    ./gradlew clean shadowJar
  3. Copy the fatJar to the Compute Engine instance.

    gcloud compute scp build/libs/dswqtobq-1.1-all.jar ${COMPUTE_ENGINE_INSTANCE_NAME}:~/
    gcloud compute scp <configuration-csv-file> ${COMPUTE_ENGINE_INSTANCE_NAME}:~/        
  4. SSH into the Compute Engine instance

    gcloud compute ssh ${COMPUTE_ENGINE_INSTANCE_NAME}

    Install Java 11 on the VM if required: sudo apt install -y openjdk-11-jdk

  5. Run the jar file

    # run the JAR file by specifying the configuraiton file as first parameter
    java -jar dswqtobq-1.1-all.jar <location of configuration CSV file> <local-report-temp-folder>

    You can schedule to run it automatically using cron, after this step.

CSV File Format

The CSV configuration file must contain following headers, The order does not matter.

  • gcsBucketName - The GCS Bucket to be used for staging CSV file for BQ upload.
  • projectId - GCP Project Id to use for billing as well as for BQ Table location.
  • datasetId - BQ Dataset id/name belonging to given projectId.
  • tableId - Prefix to be used for the BigQuery Table
  • webQueryUrl - SearchAds 360 WebQuery link

sa360-webquery-bigquery's People

Contributors

anantdamle avatar kentburke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sa360-webquery-bigquery's Issues

BUILD FAILURE- The goal you specified requires a project to execute but there is no POM in this directory

Hi,

I'm following the YouTube guide and stuck at 8m:55s in the video after getting an error when compiling the code:

mvn clean compile assembly:single

Info/Error:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.090 s
[INFO] Finished at: 2022-04-23T16:30:31Z
[INFO] ------------------------------------------------------------------------
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/home/tom_sharon). Please verify you invoked Maven from the correct directory. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException

I'm assuming it's because there's an issue with the directory. Here is my code (line 52-54 in the TransferRunner.Java file):

var webQuery = new WebQuery(xferConfig.getWebQueryUrl("this is where I pasted the web query URL"), credential);
// local file to store the report as CSV
var tempCsvFile = createCsvFile("C:\Users\toms\Documents\");

The code is a bit different and the resources folder with the client_secrets.json file aren't on github so I created them and copied the tutorial. I created a Desktop type (couldn't find "other") Credential with the SA360 API to generate the Client ID and Secret.

Also, I've not set up the sample-config.csv file as I didn't think it's required for the build to be successful?

Thanks,
Tom

Authorization Error

Hi,
I have followed step-by-step instructions mentioned in the video but unfortunately I am getting the following error mentioned below. Can you please help me with this?

Exception in thread "main" java.lang.NullPointerException
at dswebquerytobigquery.Authorizer.refreshAndStoreCredential(Authorizer.java:58)
at dswebquerytobigquery.Authorizer.authorize(Authorizer.java:51)
at dswebquerytobigquery.Main.main(Main.java:45)

Thanks for your help in advance!

Authorization failed when rerun after sometime(hours)

I am getting this error when I rerun the script a few hours after the first successful attempt. I deleted the SotredCredential file and ran again, it prompt me to login again, and everything was fine again. However, in order to schedule to run it automatically daily, it needs to be able to stored the credential and use it in the future. Could I have done something wrong here?

Much appreciated in advance.

Jun 12, 2020 5:25:36 PM dswebquerytobigquery.Main main
INFO: config file: sample-config-4.csv
Jun 12, 2020 5:25:36 PM dswebquerytobigquery.Main main
INFO: access_token: ya29.a0AfH6SMDEzSxMd3X-jiHY4r2OZSraeGoRXtnECFK_i8BrMDDwwXjwImFQA2p_rDRf7RtUIxZWUDJiIgyktvtfjtBsW-dhrLb9djE-2ngooKmC9oiCALd7cr-aaw-3YIJFUmdgIq9Kw3Wq97x2Bq1KIyYVXvV_O-kAyPM
Jun 12, 2020 5:25:36 PM dswebquerytobigquery.TransferRunner run
INFO: Processing: TransferConfig{bigQueryConfig=BigQueryConfig{projectId=marketing-database-276519, datasetId=sa360_reports, tableId=webquery_report_test_4}, tempGcsBucketName=wq-to-bq, webQueryUrl=https://searchads.google.com/ds/reports/download?ay=20700000001032621&av=21700000001165309&rid=318110&of=webqueryphtml&authuser=0}
Jun 12, 2020 5:25:36 PM dswebquerytobigquery.TransferRunner run
INFO: [Report 318110] starting: url: https://searchads.google.com/ds/reports/download?ay=20700000001032621&av=21700000001165309&rid=318110&of=webqueryphtml&authuser=0
Jun 12, 2020 5:25:36 PM dswebquerytobigquery.TransferRunner run
INFO: [Report 318110] localFile: /tmp/dswq_9286139848404184152.csv
Jun 12, 2020 5:25:37 PM dswebquerytobigquery.TransferRunner run
SEVERE: [Report 318110] Error Processing
java.io.IOException: Server returned HTTP response code: 401 for URL: https://searchads.google.com/ds/reports/download?ay=20700000001032621&av=21700000001165309&rid=318110&of=webqueryphtml&authuser=0
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1919)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
at dswebquerytobigquery.WebQuery$WebQueryReader.getStream(WebQuery.java:95)
at dswebquerytobigquery.WebQuery$WebQueryReader.writeAsCsv(WebQuery.java:102)
at dswebquerytobigquery.TransferRunner.run(TransferRunner.java:62)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

Server returned HTTP response code: 403

I used your code but with error

SEVERE: [Report 366300] Error Processing
java.io.IOException: Server returned HTTP response code: 403 for URL: https://
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1924)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
at dswebquerytobigquery.WebQuery$WebQueryReader.getStream(WebQuery.java:100)
at dswebquerytobigquery.WebQuery$WebQueryReader.writeAsCsv(WebQuery.java:107)
at dswebquerytobigquery.TransferRunner.run(TransferRunner.java:64)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Server returning 401 despite having access

  • I have
    • agency admin access to SA360
    • owner access to the project

This is the account I authenticated Cloud SDK with on VM build from cloud shell

  • The service account is designated to the VM and has
    • advertiser access to SA360
    • editor access to bq & gcs

Is there a step missing on configuring ssh key for service account?

SEVERE: [Report 352726] Error Processing
java.io.IOException: Server returned HTTP response code: 401 for URL: https://searchads.google.com/ds/reports/download?ay=2...&av=2...&rid=3...&of=webqueryphtml&authuser=0
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1924)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
        at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
        at dswebquerytobigquery.WebQuery$WebQueryReader.getStream(WebQuery.java:97)
        at dswebquerytobigquery.WebQuery$WebQueryReader.writeAsCsv(WebQuery.java:104)
        at dswebquerytobigquery.TransferRunner.run(TransferRunner.java:64)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)```

Insecure Permissions for /datastore/StoredCredential File

The sa360-webquery-biggquery Java program requests authorization from the end user and caches their OAuth2 Access Token and Refresh Token in the StoredCredential File.
Permissions for this file are: -rw-r--r-- (744)
This set of file permissions allows the Owner of the file to read and write. Additionally it allows any other user on the system to read the file.

A file on a local system which stores credentials (even short-term credentials), is a target for malware on the local system. Access should be restricted to only those users and groups requiring access. Malware known as 'TeamTNT' is known to target endpoints for locally cached credentials for cloud infrastructure.

Due to the sensitive nature of the StoredCredentials File, the file permissions should be changed to: -rw-------. This would restrict Read and Write operations to the Owner of the File.
Permissions for the 'StoredCredentials' File should match those of the ~/.config/gcloud/access_tokens.db file.
The access_tokens.db file holds the cached access tokens and refresh tokens utilized by gcloud. The file permission for the access_tokens.db sqlite file is: -rw-------

XOXO,
_kat

Support Service Account credentials

Users may want to run the tool using a scheduler (cron) on GCE or other such servers, storing user credentials on such devices is risky and error prone.

Unable to process large file

It seems like there is a limit of rows you can process with this tool. During my test, web queries that exceed 500k row will fail most of the time, but all of our sku based reporting is more than a million rows per day. I set up a vm that has 96vCPU and 360gb of ram, and it was able to process almost 1 million row sometimes, still failed most of the time.

When it failed, this is what I get in Big Query, an empty table.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.