Git Product home page Git Product logo

orchestra's Introduction

Orchestra

Orchestra is not an official Google Product

Overview

Composer is a Google Cloud managed version of Apache Airflow, an open source project for managing ETL workflows. We use it for this solution as you are able to deploy your code to production simply by moving files to Google Cloud Storage. It also provides Monitoring, Logging and software installation, updates and bug fixes for Airflow are fully managed.

It is recommended that you install this solution through the Google Cloud Platform UI.

We recommend familiarising yourself with Composer here.

Orchestra is an open source project, built on top of Composer, that is custom operators for Airflow designed to solve the needs of Advertisers.

Orchestra lets Enterprise Clients build their Advertising Data Lake out of the box and customize it to their needs

Orchestra lets sophisticated clients automate workflows at scale for huge efficiency gains.

Orchestra is a fully open sourced Solution Toolkit for building enterprise data solutions on Airflow.

Setting up your Orchestra environment in GCP

Billing

Composer and Big Query - two of the main Google Cloud Platform tools which Orchestra is based on - will require a GCP Project with a valid billing account.

See this article for more information Google Cloud Billing.

APIs

In you GCP Project menu (or directly through this link) access the API Library so that you can enable the following APIs:

  • Cloud Composer
  • Cloud Dataproc
  • Cloud Storage APIs
  • BigQuery

Create a Composer environment

Follow these steps to create a Composer environment in Google Cloud Platform - please note that it can take up to 20/30 minutes.

Environment Variables, Tags and Configuration Properties (airflow.cfg) can all be left as standard and you can use the default values for number of nodes, machine types and disk size (you can use a smaller disk size if you want to save some costs).

Service Accounts

Setting up a service account

Google Cloud uses service accounts to automate tasks between services. This includes other Google services such as DV360 and CM.

You can see full documentation for Service Accounts here:

https://cloud.google.com/iam/docs/service-accounts

Default Service Account

By default you will see in the IAM section of your Project a default service account for Composer ("Cloud Composer Service Agent") and a default service account for Compute Engine ("Compute Engine default service account") - with their respective email addresses.

These service accounts have access to all Cloud APIs enabled for your project, making them a good fit for Orchestra. We recommend you use in particular the Compute Engine Service Account (i.e. "Compute Engine default service account" because it is the one used by the individual Compute Engine virtual machines that will run your tasks) as the main "Orchestra" service account.

If you wish to use another account, you will have to give it access to BigQuery and full permissions for the Storage APIs.

Creating a new user for your service account in DV360

Your Service Account will need to be setup as a DV360 user so that it can access the required data from your DV360 account.

You need to have partner-level access to your DV360 account to be able to add a new user; follow the simple steps to create a new user in DV360, using this configuration:

  • Give this user the email of the service account you wish to use.
  • Select all the advertisers you want to be able to access
  • Give** Read&Write** permissions
  • Save!

Configuring Orchestra

You have now set up the Composer environment in GCP and granted the proper permissions to its default Service Account.
You're ready to configure Orchestra!

Variables

The Orchestra project will require several variables to run.

These can be set via the Admin section in the Airflow UI (accessible from the list of Composer Environments, clicking on the corresponding link under "Airflow Web server").

alt_text

Area Variable Name Value Needed For
Cloud Project gce_zone Your Google Compute Engine Zone (you can find it under "Location" in the list of Composer Environments) All
Cloud Project gcs_bucket The Cloud Storage bucket for your Airflow DAGs (you can find a link to the bucket in the Environments page - see Image1) All
Cloud Project cloud_project_id The Project ID you can find in your GCP console homepage. All
BigQuery erf_bq_dataset The name of the BigQuery Dataset you wish to use - see image2 and documentation here. ERFs
DV360 partner_ids The list of partners ids from DV360, used for Entity Read Files, comma separated. All
DV360 private_entity_types A comma separated list of Private Entity Read Files you would like to import. ERFs
DV360 sequential_erf_dag_name The name of your dag as it will show up in the UI. Name it whatever makes sense for you (alphanumeric characters, dashes, dots and underscores exclusively). ERFs
DV360 dv360_sdf_advertisers Dictionary of partners (keys) and advertisers (values) which will be used to download SDFs. Initially you can set up the value to: {"partner_id": ["advertiser_id1", “advertiser_id2”]} and use the dv360_get_sdf_advertisers_from_report_dag dag to update it programmatically. SDFs
DV360 dv360_sdf_advertisers_report_id DV360 report ID which will be used to get a list of all active partners and advertisers. Initially, you can set up the value as: 1 and use the dv360_create_sdf_advertisers_report_dag dag to update it programmatically. SDFs, Reports
DV360 number_of_advertisers_per_sdf_api_call Number of advertiser IDs which will be included in each call to DV360 API to retrieve SDFs. Set up the value to: 1 SDFs
DV360 sdf_api_version SDF Version (column names, types, order) in which the entities will be returned. Set up the value to: 4.2 (no other versions are currently supported). SDFs
BigQuery sdf_bq_dataset The name of the BigQuery dataset you wish to use to store SDFs. SDFs
BigQuery sdf_file_types Comma separated value of SDF types that will be returned (e.g. LINE_ITEM, AD_GROUP). Currently, this solution supports: LINE_ITEM, AD_GROUP, AD, INSERTION_ORDER and CAMPAIGN. SDFs

Image1:

alt_text

alt_text

Image2:

alt_text

Adding Workflows

As with any other Airflow deployment, you will need DAG files describing your Workflows to schedule and run your tasks; plus, you'll need hooks, operators and other libraries to help building those tasks.

You can find the core files for Orchestra in our github repository: clone the repo (or directly download the files)

You can then design the dags you wish to run and add them to the dags folder.

Upload all the DAGs and other required files to the DAGs Storage Folder that you can access from the Airflow UI.

alt_text

This will automatically generate the DAGs and schedule them to run (you will be able to see them in the Airflow UI).

From now, you can use (the Composer-managed instance of) Airflow as you normally would - including the different available functionalities for scheduling, troubleshooting, …

Additional info

Deleting an environment

Full details can be found here. Please note that files created by Composer are not automatically deleted and you will need to remove them manually or they will still incur. Same thing applies to the BigQuery datasets.

Data & Privacy

Orchestra is a Framework that allows powerful API access to your data. Liability for how you use that data is your own. It is important that all data you keep is secure and that you have legal permission to work and transfer all data you use. Orchestra can operate across multiple Partners, please be sure that this access is covered by legal agreements with your clients before implementing Orchestra. This project is covered by the Apache License.

orchestra's People

Contributors

avaz1301 avatar brettdidonatogoog avatar ceoloide avatar efolgar avatar jdban avatar jimbol4 avatar kingsleykelly avatar oczos avatar pgilmon avatar rebeccasg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

orchestra's Issues

Migrate GoogleCloudStorageToFTPOperator to core operators

Hello

My team created a new operators for Airflow that copy files from and to GCS and FTPS.
apache/airflow#6393
apache/airflow#6366

These operators can be replaced by an operator that has been implemented by your team.
https://github.com/google/orchestra/blob/Orchestra-2.0/orchestra/google/cloud/operators/gcp_gcs_operator.py
I think that it makes no sense to duplicate operators and it is better to use the operator that will be provided with the new versions of Airflow/Cloud Composer.

Best regards,

CC: @TobKed

Google Analytics storage_name_object not templated

When trying to use the new google_analytics operators, the documentation shows that the parameter storage_name_object is templated, however when I pass the following:

import_audience = GoogleAnalyticsDataImportUploadOperator(
        task_id='import_audience_list_{}'.format(destination_info['account_name']),
        storage_bucket=walden_config['result_bucket_name'],
        storage_name_object="audience-{{ds_nodash}}.csv",
        account_id=destination_info['account_id'],
        web_property_id=destination_info['web_property_id'],
        custom_data_source_id=destination_info['custom_data_source_id'],
        mime_type='application/octet-stream',
        api_version='v3',
        api_name='analytics',
        gcp_conn_id='google_cloud_default',
        data_import_filename='audience-{{ds_nodash}}.csv',
        dag=dag
    )

The dag fails with the following error:

HttpError 404 when requesting https://storage.googleapis.com/storage/v1/b/ <my_bucket> /o/audience-%7B%7Bds_nodash%7D%7D.csv?alt=media returned "Not Found">

I've removed my bucket name here, but it is correct in the url

It looks like it's not recognizing the templated {{ds_nodash}} here.

We are running v1.10.2-composer

"Requested Resource Too Large to Return" - GoogleDisplayVideo360

Hi,

I set up a process which uses the GoogleDisplayVideo360SDFToBigQueryOperator from display_video_360. However for some advertisers which contain a lot of data I got this error.

ERROR - <HttpError 500 when requesting https://www.googleapis.com/doubleclickbidmanager/v1/sdf/download?alt=json returned "Requested Resource Too Large to Return">

Is there anyway that I can reduce the amount of data from this SDF API call ?

Thanks !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.