Git Product home page Git Product logo

aztk's Introduction

Azure Distributed Data Engineering Toolkit (AZTK)

Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Status

This repository has been marked for archival. It is no longer maintained.

Notable Features

Setup

  1. Install aztk with pip:
    pip install aztk
  1. Initialize the project in a directory. This will automatically create a .aztk folder with config files in your working directory:
    aztk spark init
  1. Login or register for an Azure Account, navigate to Azure Cloud Shell, and run:
wget -q https://raw.githubusercontent.com/Azure/aztk/v0.10.3/account_setup.sh -O account_setup.sh &&
chmod 755 account_setup.sh &&
/bin/bash account_setup.sh
  1. Follow the on screen prompts to create the necessary Azure resources and copy the output into your .aztk/secrets.yaml file. For more information see Getting Started Scripts.

Quickstart Guide

The core experience of this package is centered around a few commands.

# create your cluster
aztk spark cluster create
aztk spark cluster add-user
# monitor and manage your clusters
aztk spark cluster get
aztk spark cluster list
aztk spark cluster delete
# login and submit applications to your cluster
aztk spark cluster ssh
aztk spark cluster submit

1. Create and setup your cluster

First, create your cluster:

aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
  • See our available VM sizes here.
  • The --vm-size argument must be the official SKU name which usually come in the form: "standard_d2_v2"
  • You can create low-priority VMs at an 80% discount by using --size-low-pri instead of --size
  • By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info here
  • By default, AZTK will create a user (with the username spark) for your cluster
  • The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
  • By default, you cannot create clusters of more than 20 cores in total. Visit this page to request a core quota increase.

More information regarding using a cluster can be found in the cluster documentation

2. Check on your cluster status

To check your cluster status, use the get command:

aztk spark cluster get --id my_cluster

3. Submit a Spark job

When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:

// submit a java application
aztk spark cluster submit \
    --id my_cluster \
    --name my_java_job \
    --class org.apache.spark.examples.SparkPi \
    --executor-memory 20G \
    path\to\examples.jar 1000
    
// submit a python application
aztk spark cluster submit \
    --id my_cluster \
    --name my_python_job \
    --executor-memory 20G \
    path\to\pi.py 1000
  • The aztk spark cluster submit command takes the same parameters as the standard spark-submit command, except instead of specifying --master, AZTK requires that you specify your cluster --id and a unique job --name
  • The job name, --name, argument must be at least 3 characters long
    • It can only contain alphanumeric characters including hyphens but excluding underscores
    • It cannot contain uppercase letters
  • Each job you submit must have a unique name
  • Use the --no-wait option for your command to return immediately

Learn more about the spark submit command here

4. Log in and Interact with your Spark Cluster

Most users will want to work interactively with their Spark clusters. With the aztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:

aztk spark cluster ssh --id my_cluster --user spark

By default, we port forward the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and the Spark History Server to localhost:18080.

You can configure these settings in the .aztk/ssh.yaml file.

NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server. To do so, you need to setup your cluster with the appropriate docker image and plugin. See Plugins for more information.

5. Manage and Monitor your Spark Cluster

You can also see your clusters from the CLI:

aztk spark cluster list

And get the state of any specified cluster:

aztk spark cluster get --id <my_cluster_id>

Finally, you can delete any specified cluster:

aztk spark cluster delete --id <my_cluster_id>

FAQs

Next Steps

You can find more documentation here

aztk's People

Contributors

amolthacker avatar audac1ty avatar brettsimons avatar brnleehng avatar dciborow avatar emlyn avatar imcdnzl avatar jafreck avatar jiata avatar lachiemurray avatar mmduyzend avatar pabsel avatar paselem avatar shtratos avatar stevekuo4 avatar timotheeguerin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aztk's Issues

Support for custom WASB auth

  • Update configuration.cfg to secrets.cfg
  • Configure spark core-site.xml with storage credentials
  • Update docs with WASB usage

azb spark cluster submit cannot reference files that are not in working directory

This doesn't work (absolute path):
azb spark cluster submit ... /Users/jiata/code/thunderbolt/examples/src/main/python/pi.py 100

This also doesn't work (relative path):
azb spark cluster submit ... ../another-thunderbolt/examples/src/main/python/pi.py 100

This works (we we'd expect):
azb spark cluster submit ... /examples/src/main/python/pi.py 100

Error creating cluster having a resize error with --wait

./bin/spark-cluster-create with --wait option throws following error

File "./bin/spark-cluster-create", line 111, in
wait)
File "/Users/Amol/Documents/GitHub/DTDE/myenv/lib/python3.6/site-packages/dtde-0.1-py3.6.egg/dtde/clusterlib.py", line 170, in create_cluster
wait)
File "/Users/Amol/Documents/GitHub/DTDE/myenv/lib/python3.6/site-packages/dtde-0.1-py3.6.egg/dtde/util.py", line 167, in create_pool_if_not_exist
batch_models.ComputeNodeState.idle)
File "/Users/Amol/Documents/GitHub/DTDE/myenv/lib/python3.6/site-packages/dtde-0.1-py3.6.egg/dtde/util.py", line 193, in wait_for_all_nodes_state
if pool.resize_error is not None:
AttributeError: 'CloudPool' object has no attribute 'resize_error'

jupyter sample not working for spark versions 1.6.3

preloaded jupyter sample code does not work for spark 1.6.3 - in the sample code, we call "spark" which automatically fetches the Spark Session, but Spark Sessions do not exist in Spark 1.6.3! We have to call "SparkContext" instead

space required after colon in secret.yaml file

it seems like yaml requires a space after the colon.

for example, this works:

batchaccountname: myaccount

but this doesn't:

batchaccountname:myaccount

Possibly not a bug, but is this the expected behavior? Or are we just parsing so that it requires the space?

spark-cluster-get for deleted/deleting clusters throws an error

spark-cluster-get returns following

State: deleting
Node Size: standard_d2_v2
Nodes: 0 -> 3
| Dedicated: 0
| Low priority: 0

Nodes State IP:Port Master
Traceback (most recent call last):
File "./bin/spark-cluster-get", line 49, in
pool_id)

/dtde/clusterlib.py", line 268, in get_cluster_details
master_node = util.get_master_node_id(batch_client, pool_id)
...
azure.batch.models.batch_error.BatchErrorException: {'lang': 'en-US', 'value': 'The specified job does not exist.\nRequestId:1f02160c-0cb7-41e1-8f3f-c0f4340f1e0a\nTime:2017-07-04T19:52:22.0409662Z'}

azb cmds requires sshkey

i reinstalled thunderbolt and couldn't get the commands to work until i added an sshkey to my .thunderbolt/secrets.yaml file

screen shot 2017-09-22 at 9 11 39 am

sshkey not working from config when using the ssh cmd

When I run azb spark cluster create --id test by default it should create the cluster, wait, and then create the user for the cluster (with the specified SSH_key)

However, this is broken today. After waiting for the cluster to spin up, i should be able to run azb spark cluster ssh --id test, and it should just use the ssh key. Instead, i get prompted with inputing a password.

This makes me think that creating the cluster with the user did not work properly.

username without a ssh-key or password

My cluster.yaml is left as default (which sets username: spark)

When i run azb spark cluster create --id ignitedemo, it shows:

-------------------------------------------
spark cluster id:        ignitedemo
spark cluster size:      2
>        dedicated:      2
>     low priority:      0
spark cluster vm size:   standard_a2
path to custom script:   None
docker repo name:        jiata/thunderbolt:0.1.0-spark2.2.0-python3.5.4
wait for cluster:        True
username:                spark
-------------------------------------------
Waiting for all nodes in pool ignitedemo to reach desired state...

So this is confusing for the user that doesn't want to use an ssh key (windows users). It's supposedly creating a username spark, but with what password?

I think it should prompt for a password if no ssh is set, or we should not, by default, prefill the username field in the config

Then i get this error, which is good - but it is only thrown after the cluster is created:

dtde.error.InvalidUserCredentialsError: Cannot add user to cluster. Need to provide a ssh public key or password.

spark-submit errors if names don't work in Azure

spark job names are limited based on the back-end, and will error out if they do not meet the requirements. We should have a better/friendlier error for this as well as update the documentation to explain this.

Logger implementation

Remove all the print statement and use a logger instead. So we can have --verbose arguments

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.