Git Product home page Git Product logo

gcp-dataproc_serverless-running-notebooks's Introduction

gcp-dataproc_serverless-running-notebooks

Objective

Orchestrator to run Notebooks on Dataproc Serverless via Cloud Composer

What is Dataproc Serverless?

Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster. You can specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.

With Dataproc Serverless, You can run Notebooks and streamline the end-to-end data science workflow without provisioning a cluster. Also, you can run Spark batch workloads without provisioning and managing the clusters and servers. This improves developer productivity and decreases infrastructure costs. A Fortune 10 retailer uses Dataproc Serverless for optimizing retail assortment space for 500M+ items.

For more details, check this document

File Directory Structure

dataproc_serverless

├── composer_input                   
│   ├── jobs/                       vehicle_analytics_executor.py
│   ├── DAGs/                       serverless_airflow.py
├── notebooks 
│   ├── datasets/                   electric_vehicle_population.csv
│   ├── jupyter/                    vehicle_analytics.ipynb
│   ├── jupyter/output/             vehicle_analytics_sample_output.ipynb
│   ├── jupyter/dependencies/       requirements.txt (optional)
│   ├── jupyter/refs/                reference.py (optional)

File Details

composer_input

  • vehicle_analytics_executor.py: runs a papermill execution of input notebook and writes the output file into the assgined location
  • serverless_airflow.py: orchestrates the workflow
    • Dataproc Batch Creation :This operator runs your workloads on Dataproc Serverless
      create_batch1 = DataprocCreateBatchOperator()
      create_batch2 = DataprocCreateBatchOperator()...
      

notebooks

  • vehicle_analytics.ibynp : This file creates Spark Session for Electric Vehicle Population, loads csv file in GCS and perform fundamental Spark functions. Also, it installs Python Pakages listed in requirements.txt and runs sample Python reference file (optional)

    spark = SparkSession \
        .builder \
        .appName("Spark Session for Electric Vehicle Population") \
        .getOrCreate()
    
    path=f"gs://{gcs_bucket}/dataproc_serverless/notebooks/datasets/electric_vehicle_population.csv"
    df = spark.read.csv(path, header=True)
    df.show()
    
    df.filter(df['Model Year'] > 2020).show(5)
    sortedByElectricRange = df.orderBy('Electric Range').show(10)
    
  • jupyter/output : this is where papermill notebook execution outputs will be stored Sample Notebook Output Screenshot 2023-03-16 at 2 20 35 AM

  • electric_vehicle_population.csv : this dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).

  • requirements.txt : Python packages to be installed on Notebooks

  • reference.py : Python reference file to be evoked on Notebooks

I) Orchestrating End to End Notebook Execution workflow in Cloud Composer

  1. Make sure to modify gcs path for datasets in Notebook and make any additional changes.

  2. Create a Cloud Composer Environment

  3. Find DAGs folder from Composer Environment and add serverless_airflow.py (DAGs file) to it in order to trigger DAGs execution: DAG folder from Cloud Composer Console

  4. Have all the files available in GCS bucket, except DAGs file which should go into your Composer DAGs folder

  5. Create Variables from Airflow UI for

    • gcp_project - Google Cloud Project to use for the Cloud Dataproc cluster.

    • gce_zone - Google Compute Engine zone where Cloud Dataproc cluster should be created.

    • gcs_bucket - Google Cloud Storage bucket where all the files are stored

    • phs_region = Persistent History Server region ex) us-central1 (optional)

    • phs - Persistent History Server name (optional)

      Screenshot 2023-02-07 at 11 29 09 AM
  6. Open Airflow UI to monitor DAG executions, Runs and logs Screenshot 2023-03-16 at 2 11 22 AM

Closing Note

If you're adapting this example for your own use consider the following:

  • Setting an appropriate input path within your environment (gcs, DAGs folder, etc)
  • Setting more appropriate configurations (DAGs, Dataproc cluster, persistent history server, etc)

gcp-dataproc_serverless-running-notebooks's People

Contributors

kristin-kim avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.