Git Product home page Git Product logo

dnae's Introduction

DNAE - DoubleClick Network Analysis Enabler

A data integration framework built on top of Google Marketing Platform (fka DoubleClick) APIs and Google Cloud Platform.

OVERVIEW

DNAE implements an ETL-like framework that can extract data from the Google Marketing Platform (formerly DoubleClick) platforms (DBM, DCM, DS), transform it as necessary and load transformed data onto Google Cloud Storage and Big Query. Taking advantage of the built-in BigQuery connector, Google DataStudio can be used as visualization tool. The framework is modular and can implement multiple "services" to provide different kind of ETL flows and data insights.

Please note that this is not an officially supported Google product.

INITIAL SETUP OF A DNAE PROJECT

Note: the following steps illustrate how to set up your DNAE project using the included setup scripts. Feel free to customize your setup installing the necessary files manually.

  • Download all the files in this folder on a local machine

  • From the command line, install the needed external libraries:

    pip install -r requirements.txt
  • Setup your Google Cloud project:

    • Make sure you have a Google Cloud project ready to use, or create one (please refer to the Google Cloud documentation for any additional information on how to create a Project).

    • Install the Google Cloud SDK if you haven’t already.

    • Check that your “default credentials” with gcloud correspond to the Google Account of the Google Cloud Project you want to create/use:

      gcloud auth application-default login
  • Run the script to setup the DNAE project, and follow the instructions:

    python dna_project_setup.py
    • Check if you have write access to the files before running the script

    • The interactive script will let you select the Google Cloud Platform project, give you instructions on which APIs to enable, guide you to the setup of the needed Credentials and update the template files with the IDs and the access details of your specific implementation

    • DNAE uses v2 of the Task Queue REST API (now called Cloud Task API). This version is currently in "alpha" and you might need to have your account whitelisted in order to use the API (search the Cloud documentation for the latest status).

    • If something goes wrong, you can run the followins script to restore the files to previous backup:

      python dna_restore_backup.py
  • DNAE (minus your specific service) is ready! Deploy the files to App Engine:

    ./deploy.sh
    • To check that everything is OK, go to the Google Cloud Platform console > App Engine > Versions and you should see your “v1” app correctly serving
    • you should also see a few “Cron Jobs” in App Engine > Cron jobs and they should run successfully if you start them manually
    • Last but not least you should see your source files and scripts in the correspondent Cloud Storage buckets
  • You will now need to build your own service to add to the DNAE framework.

    • You can start running the following command to create the service folder and the main files you need (starting from the template files in “services/service-template”):

      python dna_service_setup.py
    • Have a look at the sample service in folder “services/service-example” to see how you can interact with the GMP/DoubleClick APIs through the connectors included in DNAE, how to get the configuration data from an external Spreadsheet, how to elaborate the data before pushing it to Big Query and so on..

  • This setup creates 5 default cron jobs (in App Engine > Cron Jobs), all handled through methods in lib/core/dna_gae_handlers.py:

    • /core/cron/compute which is the actual "task manager" job checking the Cloud Tasks queue for new tasks, for each of which starts a new Compute Engine VM instance
    • /core/cron/check/bqjobs which updates any DataStore entry with bqstatus, bqjob and bqerror fields with the latest status of the corresponding BigQuery job
    • /core/cron/cleanup/compute which deletes the Compute Engine VM instances which have completed their job
    • /core/cron/cleanup/datastore which removes DataStore entities (typically every night, before a new run of the whole process)
    • /core/cron/cleanup/storage which deletes files from Cloud Storage if older than a predefined number of days. In particular, if you want to use this cleanup job you need to create a new Entity in Datastore with kind name DNACleanUpGCS and with the following properties (case sensitive):
      • bucket
        • type: String
        • value: my-cloud-storage-bucket
      • lbw
        • type: integer
        • value: an integer representing a number of days for your Look Back Window (i.e. the number of days after which report files are removed from GCS)

More info about a DNAE service

A typical DNAE-based service folder will include:

  • a settings file, e.g. service_example_settings.py (in fact, see the files in folder service-example as reference)

    • Keep in mind to:
      • Create the GCS buckets you’re referencing in your settings (like project-name-service-example)
      • Create the BQ datasets you’re referencing in your settings (e.g. service_example_tables) using “unspecified” as location)
    • This is also where you’ll define the report/query params structure (json objects following the different API specifications - e.g. DBM’s query resource which can be tested via the API Explorer, BigQuery data schemas and other service-specific structures.
  • The main script, service_example_run.py with the actual steps to import and manage the data. This is obviously the most customizable (and complex!) part, where you’ll need to implement the import from the different data sources, the upload to the GCS bucket(s), the eventual transformation of the data, and the upload to the BigQuery dataset.

  • A test file for your service, e.g. service_example_test.py which will basically launch locally the same tasks that will be handled via the Cron Jobs and Cloud Tasks on GCP. In particular you’ll gather all relevant inputs/parameters (possibly a reduced set to make the test quicker) and then call the main function from service_example_run.py passing those parameters - just like you will do when creating the handler function for GAE calls (see below).

  • The initial shell script file (e.g. service-example-run.sh, which calls the main python script (with two parameters for queue name and task id matching the arguments expected by the main method of service_example_run.py)

  • The file handling the requests coming from the AppEngine cron jobs, e.g. service_example_gae_handlers.py:

    • You need at least one main handler (e.g. ServiceExampleLauncher), which is referenced in appengine_main.py as the method handling the calls to your service (calls coming from the scheduled cron jobs in cron.yaml). This handler, for each iteration (e.g. for each row of your configuration sheet) sets up a new task_params object with all needed parameters, such as service name, region, reference to the initial shell script (service-example-run.sh), bucket, dataset.. anything needed by your service! This object is packaged into a payload, and the new task is added to the Cloud Tasks queue:

      gcp.gct_createtask(queue_name, payload)

DNAE libraries and folders

The standard DNAE setup has:

  • A lib folder, which has

    • a connectors folder, which includes all the libraries to “wrap” API functionalities for DBM, DCM, DS, Google Cloud Platform and Google Sheets
    • a core folder, which includes the main files which will be copied into the Compute Engine virtual machine executing each task
    • a utils folder, including different utility libraries
  • A services folder, one for each service running in the project and with the corresponding files.

dnae's People

Contributors

jaypizzle avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.