Git Product home page Git Product logo

sunkarisanthosh / initialization-actions Goto Github PK

View Code? Open in Web Editor NEW

This project forked from googleclouddataproc/initialization-actions

0.0 0.0 0.0 34.45 MB

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster

Home Page: https://cloud.google.com/dataproc/init-actions

License: Apache License 2.0

Shell 66.16% Python 30.22% R 0.02% Dockerfile 0.13% Starlark 3.46%

initialization-actions's Introduction

Cloud Dataproc Initialization Actions

When creating a Dataproc cluster, you can specify initialization actions in executables and/or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

How initialization actions are used

Initialization actions must be stored in a Cloud Storage bucket and can be passed as a parameter to the gcloud command or the clusters.create API when creating a Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud command, you can run:

gcloud dataproc clusters create <CLUSTER_NAME> \
    [--initialization-actions [GCS_URI,...]] \
    [--initialization-action-timeout TIMEOUT]

During development, you can create a Dataproc cluster using Dataproc-provided regional initialization actions buckets (for example goog-dataproc-initialization-actions-us-east1):

REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh

⚠️ NOTICE: For production usage, before creating clusters, it is strongly recommended that you copy initialization actions to your own Cloud Storage bucket to guarantee consistent use of the same initialization action code across all Dataproc cluster nodes and to prevent unintended upgrades from upstream in the cluster:

BUCKET=<your_init_actions_bucket>
CLUSTER=<cluster_name>
gsutil cp presto/presto.sh gs://${BUCKET}/
gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh

You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. Doing this is also useful if you want to modify initialization actions to meet your needs.

Why these samples are provided

These samples are provided to show how various packages and components can be installed on Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.

Actions provided

This repository currently offers the following actions for use with Dataproc clusters.

Removed actions

Previously, this repo provided init actions for the following, which have since been removed because equivalent functionality is now provided directly by Dataproc:

Initialization actions on single node clusters

Single Node clusters have dataproc-role set to Master and dataproc-worker-count set to 0. Most of the initialization actions in this repository should work out of the box because they run only on the master. Examples include notebooks, such as Apache Zeppelin, and libraries, such as Apache Tez. Actions that run on all nodes of the cluster, such as cloud-sql-proxy, also work out of the box.

Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.

  • Apache Drill
  • Apache Flink
  • Apache Kafka
  • Apache Zookeeper

Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.

Using cluster metadata

Dataproc sets special metadata values for the instances that run in your cluster. You can use these values to customize the behavior of initialization actions, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

You can also use the ‑‑metadata flag of the gcloud dataproc clusters create command to provide your own custom metadata:

gcloud dataproc clusters create cluster-name \
    --initialization-actions ... \
    --metadata name1=value1,name2=value2,... \
    ... other flags ...

For more information

For more information, review the Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Mailing list

Subscribe to [email protected] for announcements and discussion.

Contributing changes

Licensing

initialization-actions's People

Contributors

medb avatar functicons avatar chimerasaurus avatar karth295 avatar bradmiro avatar szewi avatar bsidhom avatar dennishuo avatar cyxxy avatar mengdong avatar pmkc avatar dariuszaniszewski avatar viadea avatar mikaylakonst avatar dansedov avatar nehalecky avatar gogasca avatar aman-ebay avatar arisha84 avatar chethanuk avatar jmikula avatar jphalip avatar axelmagn avatar aniket486 avatar roderickyao avatar sameerz avatar animeshnandanwar avatar ranu010101 avatar gaurangi94 avatar findepi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.