kubeflow / spark-operator Goto Github PK

View Code? Open in Web Editor NEW

2.6K 2.6K 1.3K 24.79 MB

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

License: Apache License 2.0

Go 95.88% Shell 2.16% Dockerfile 0.60% Makefile 0.42% Smarty 0.59% Mustache 0.36%

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

spark-operator's Introduction

Kubeflow Spark Operator

Overview

The Kubernetes Operator for Apache Spark aims to make specifying and running Spark applications as easy and idiomatic as running other workloads on Kubernetes. It uses Kubernetes custom resources for specifying, running, and surfacing status of Spark applications. For a complete reference of the custom resource definitions, please refer to the API Definition. For details on its design, please refer to the design doc. It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.

The Kubernetes Operator for Apache Spark currently supports the following list of features:

Supports Spark 2.3 and up.
Enables declarative application specification and management of applications through custom resources.
Automatically runs spark-submit on behalf of users for each SparkApplication eligible for submission.
Provides native cron support for running scheduled applications.
Supports customization of Spark pods beyond what Spark natively is able to do through the mutating admission webhook, e.g., mounting ConfigMaps and volumes, and setting pod affinity/anti-affinity.
Supports automatic application re-submission for updated SparkApplication objects with updated specification.
Supports automatic application restart with a configurable restart policy.
Supports automatic retries of failed submissions with optional linear back-off.
Supports mounting local Hadoop configuration as a Kubernetes ConfigMap automatically via sparkctl.
Supports automatically staging local application dependencies to Google Cloud Storage (GCS) via sparkctl.
Supports collecting and exporting application-level metrics and driver/executor metrics to Prometheus.

Project Status

Project status: beta

Current API version: v1beta2

If you are currently using the v1beta1 version of the APIs in your manifests, please update them to use the v1beta2 version by changing apiVersion: "sparkoperator.k8s.io/<version>" to apiVersion: "sparkoperator.k8s.io/v1beta2". You will also need to delete the previous version of the CustomResourceDefinitions named sparkapplications.sparkoperator.k8s.io and scheduledsparkapplications.sparkoperator.k8s.io, and replace them with the v1beta2 version either by installing the latest version of the operator or by running kubectl create -f manifest/crds.

Customization of Spark pods, e.g., mounting arbitrary volumes and setting pod affinity, is implemented using a Kubernetes Mutating Admission Webhook, which became beta in Kubernetes 1.9. The mutating admission webhook is disabled by default if you install the operator using the Helm chart. Check out the Quick Start Guide on how to enable the webhook.

Prerequisites

Version >= 1.13 of Kubernetes to use the subresource support for CustomResourceDefinitions, which became beta in 1.13 and is enabled by default in 1.13 and higher.
Version >= 1.16 of Kubernetes to use the MutatingWebhook and ValidatingWebhook of apiVersion: admissionregistration.k8s.io/v1.

Installation

The easiest way to install the Kubernetes Operator for Apache Spark is to use the Helm chart.

$ helm repo add spark-operator https://kubeflow.github.io/spark-operator

$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace

This will install the Kubernetes Operator for Apache Spark into the namespace spark-operator. The operator by default watches and handles SparkApplications in every namespaces. If you would like to limit the operator to watch and handle SparkApplications in a single namespace, e.g., default instead, add the following option to the helm install command:

--set sparkJobNamespace=default

For configuration options available in the Helm chart, please refer to the chart's README.

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version	API Version	Kubernetes Version	Base Spark Version	Operator Image Tag
`latest` (master HEAD)	`v1beta2`	1.13+	`3.0.0`	`latest`
`v1beta2-1.3.3-3.1.1`	`v1beta2`	1.16+	`3.1.1`	`v1beta2-1.3.3-3.1.1`
`v1beta2-1.3.2-3.1.1`	`v1beta2`	1.16+	`3.1.1`	`v1beta2-1.3.2-3.1.1`
`v1beta2-1.3.0-3.1.1`	`v1beta2`	1.16+	`3.1.1`	`v1beta2-1.3.0-3.1.1`
`v1beta2-1.2.3-3.1.1`	`v1beta2`	1.13+	`3.1.1`	`v1beta2-1.2.3-3.1.1`
`v1beta2-1.2.0-3.0.0`	`v1beta2`	1.13+	`3.0.0`	`v1beta2-1.2.0-3.0.0`
`v1beta2-1.1.2-2.4.5`	`v1beta2`	1.13+	`2.4.5`	`v1beta2-1.1.2-2.4.5`
`v1beta2-1.0.1-2.4.4`	`v1beta2`	1.13+	`2.4.4`	`v1beta2-1.0.1-2.4.4`
`v1beta2-1.0.0-2.4.4`	`v1beta2`	1.13+	`2.4.4`	`v1beta2-1.0.0-2.4.4`
`v1beta1-0.9.0`	`v1beta1`	1.13+	`2.4.0`	`v2.4.0-v1beta1-0.9.0`

When installing using the Helm chart, you can choose to use a specific image tag instead of the default one, using the following option:

--set image.tag=<operator image tag>

Get Started

Get started quickly with the Kubernetes Operator for Apache Spark using the Quick Start Guide.

If you are running the Kubernetes Operator for Apache Spark on Google Kubernetes Engine and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the GCP guide.

For more information, check the Design, API Specification and detailed User Guide.

Contributing

Please check CONTRIBUTING.md and the Developer Guide out.

Community

Join our Kubeflow Slack Channel
Check out who is using the Kubernetes Operator for Apache Spark.

spark-operator's People

Contributors

Stargazers

Watchers

Forkers

liyinan926 amitkumarj441 eversmily foxish zak-hassan resouer elgalu holdenk sven0726 palak2011 aland-zhang datalayer-externals paul-rogers ash211 lyft pablogore-zz vglafirov cerndb jkremser wenhaochen dylangrafmyre bjg2 lenadroid scotthew1 tengpeng dharmeshkakadia tmckayus crobby juv simisoz xieydd wjliu lightbend u2takey mortent kumare3 dailong deveshk0 yunleiz maduhu mucahitkantepe huanwei mapr varuns-zetaris sarjeet2013 pujithaz krisss85 jxie418 puneetloya samsonov90 jiaxuanzhou skonto anandswaminathan tizhou86 rohit-menon nurland ezc maasg grzegorzlyczba amukhopad skalinets tiagoooliveira roksolana-d ringtail xiaoxubeii mr-robot-in acmore idole 40660367 fit2cloudrd tedli shouhong andrewgdavis danferrante74 mohanraj1311 s-pedamallu staugust dyhpoon rogervaas ynwssjx razius praves77 yubobo gdcollect chenqin wangmiao1981 harrisonfeng damozhiying xaviercallens jessieamorris luca-tiozzo-ict washkinazy vsaliy agemooij barkulu bhanditz dongjoon-hyun dalavancloud askulkarni2 edilmo

spark-operator's Issues

Update to support using the default single Spark image

It's been changed upstream to combine the images for the driver, executor, and init-container into a single image (apache/spark#20192). The Spark operator needs to be updated to support that.

Allow for NodePort to be configurable / switched to LoadBalancer

Ref: liyinan926/spark-operator#1.

@sylus, I moved the issue over to this new repository where future development will be done.

Add cron support

Implementation (#125).
Unit tests.
Documentation.

Add a Helm chart for the operator deployment

Add support for staging local dependencies to a HTTP server (e.g., RSS)

And before the RSS is upstreamed, an intermediate alternative is to support staging local dependencies to a simple HTTP-based file server running inside a K8s cluster, without security baked in for the downloading part that is in-cluster communication. The simple file server probably should support HTTPs for the uploading part that is often from outside the cluster.

Replacing the initializer with a mutating webhook

code references k8s.io/spark-on-k8s-operator which does not resolve to an accessible repo

i'm trying to make use of the generated client code, but can't even fetch the dependency because the client refers to a repo that doesn't exist...

import (
	sparkClientset "github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/client/clientset/versioned"
	sparkV1aplpha1 "github.com/GoogleCloudPlatform/spark-on-k8s-operator/pkg/apis/sparkoperator.k8s.io/v1alpha1"
)

$ go get
# cd .; git clone https://github.com/kubernetes/spark-on-k8s-operator /Users/scott/go/src/k8s.io/spark-on-k8s-operator
Cloning into '/Users/scott/go/src/k8s.io/spark-on-k8s-operator'...
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
package k8s.io/spark-on-k8s-operator/pkg/apis/sparkoperator.k8s.io: exit status 128
package k8s.io/spark-on-k8s-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1alpha1: cannot find package "k8s.io/spark-on-k8s-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1alpha1" in any of:
	/usr/local/Cellar/go/1.10/libexec/src/k8s.io/spark-on-k8s-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1alpha1 (from $GOROOT)
	/Users/scott/go/src/k8s.io/spark-on-k8s-operator/pkg/client/clientset/versioned/typed/sparkoperator.k8s.io/v1alpha1 (from $GOPATH)

Dockerfile source code of gcr.io/ynli-k8s/spark

Hi, thanks for this project.

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/4f74796bae8dad06fea9d27d3337fda8e69850c4/Dockerfile#L17

Could you point me to the source code of this base docker image?

Add support for dumping application logs to GCS

Integration with job history server with GCS as the backend storage

Provide a reference template for the job history server Deployment.
Add documentation.

Add restartPolicy field

We have a lot of Streaming applications that need to be restarted if they ever die.
Currently, the controller considers an application finished if the driver pod dies for any reason.
We would like to be able to specify restartPolicy: Always or restartPolicy: OnFailure to get behavior similar to Kubernetes Deployments and Jobs, respectively, ie have the controller re-submit a completed or failed application.

Support non-container-local application dependencies

Currently only container local dependencies or remote dependencies that can be downloaded are supported. This means users need either bake their dependencies into custom images or manually upload them to, e.g., a Google Cloud Storage bucket.

Basic e2e testing

@foxish

separate successfulJobsHistoryLimit and failedJobsHistoryLimit

Currently pastRunNames doesn't differentiate between successful and failed runs. Separating them like k8s cronjobs (failedJobsHistoryLimit and successfulJobsHistoryLimit ) allow retaining failed jobs for later debugging.

duplicated drivers when used restful api and RestartPolicy=Never

When i used restful api to create a SparkApplication for the spark pi.I got duplicate drivers and service.
I saw the issue #155 , and i pull the new master and rebuild it,but it was always happening.

my k8s cluster has one master and two workers.

my restful request:

curl --basic -u username:passwd -k -l -H "Content-type: application/json" -X POST -d '{"apiVersion":"sparkoperator.k8s.io/v1alpha1","kind":"SparkApplication","metadata":{"name":"spark-pi001","namespace":"default"},"spec":{"driver":{"memory":"512m","coreLimit":"200m","cores":0.1,"labels":{"version":"2.3.0"},"serviceAccount":"spark","volumeMounts":[{"mountPath":"/SkyDiscovery/cephfs/user/username","name":"test-volume"}]},"executor":{"cores":1,"instances":2,"labels":{"version":"2.3.0"},"memory":"512m","volumeMounts":[{"mountPath":"/SkyDiscovery/cephfs/user/username","name":"test-volume"}]},"image":"spark:v2.3.0_skydata","mainApplicationFile":"local:///SkyDiscovery/cephfs/user/username/spark-examples_2.11-2.3.0.jar","arguments":[],"mainClass":"org.apache.spark.examples.SparkPi","mode":"cluster","restartPolicy":"Never","type":"Scala","volumes":[{"name":"test-volume","hostPath":{"path":"/SkyDiscovery/cephfs/user/username"}}]}}' https://*****:6443/apis/sparkoperator.k8s.io/v1alpha1/namespaces/default/sparkapplications

duplicate drivers：

spark-pi001-090fd4fdefb131e08b4142efea55eca9-driver 0/1 ContainerCreating 0 20s
spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-driver 0/1 ContainerCreating 0 19s

duplicate service：

spark-pi001-2249495700-ui-svc
spark-pi001-2284845769-ui-svc
spark-pi001-090fd4fdefb131e08b4142efea55eca9-driver-svc
spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-driver-svc

when i check the operator log, spark-pi001 was updated after it was added,then spark-submit was executed twice.

I0528 06:20:17.754806 1 controller.go:169] SparkApplication spark-pi001 was added, enqueueing it for submission
I0528 06:20:17.760060 1 sparkui.go:75] Creating a service spark-pi001-2284845769-ui-svc for the Spark UI for application spark-pi001
I0528 06:20:17.838537 1 submission_runner.go:81] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://10.96.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=default --conf spark.app.name=spark-pi001 --conf spark.kubernetes.container.image=192.168.61.32:9980/skydiscovery/spark:v2.3.0_skydata --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi001 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi001-2284845769 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.driver.cores=0.100000 --conf spark.kubernetes.driver.limit.cores=200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGiEvU2t5RGlzY292ZXJ5L2NlcGhmcy91c2VyL3NqbHRlc3QiAA== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRIlCiMKIS9Ta3lEaXNjb3ZlcnkvY2VwaGZzL3VzZXIvc2psdGVzdA== --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi001 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-id=spark-pi001-2284845769 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.executor.instances=2 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.3.0 --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGiEvU2t5RGlzY292ZXJ5L2NlcGhmcy91c2VyL3NqbHRlc3QiAA== --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRIlCiMKIS9Ta3lEaXNjb3ZlcnkvY2VwaGZzL3VzZXIvc2psdGVzdA== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/ownerreference=ChBTcGFya0FwcGxpY2F0aW9uGgtzcGFyay1waTAwMSIkMzBlM2ViMzctNjIzZi0xMWU4LWE2NzItZmExNjNlNWE4ZDg3Kh1zcGFya29wZXJhdG9yLms4cy5pby92MWFscGhhMQ== local:///SkyDiscovery/cephfs/user/username/spark-examples_2.11-2.3.0.jar]
I0528 06:20:18.338777 1 controller.go:218] SparkApplication spark-pi001 was updated, enqueueing it for submission
I0528 06:20:18.344042 1 sparkui.go:75] Creating a service spark-pi001-2249495700-ui-svc for the Spark UI for application spark-pi001
I0528 06:20:18.433535 1 submission_runner.go:81] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://10.96.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=default --conf spark.app.name=spark-pi001 --conf spark.kubernetes.container.image=192.168.61.32:9980/skydiscovery/spark:v2.3.0_skydata --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi001 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi001-2249495700 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.driver.cores=0.100000 --conf spark.kubernetes.driver.limit.cores=200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGiEvU2t5RGlzY292ZXJ5L2NlcGhmcy91c2VyL3NqbHRlc3QiAA== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRIlCiMKIS9Ta3lEaXNjb3ZlcnkvY2VwaGZzL3VzZXIvc2psdGVzdA== --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi001 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-id=spark-pi001-2249495700 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.executor.instances=2 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.3.0 --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGiEvU2t5RGlzY292ZXJ5L2NlcGhmcy91c2VyL3NqbHRlc3QiAA== --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRIlCiMKIS9Ta3lEaXNjb3ZlcnkvY2VwaGZzL3VzZXIvc2psdGVzdA== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/ownerreference=ChBTcGFya0FwcGxpY2F0aW9uGgtzcGFyay1waTAwMSIkMzBlM2ViMzctNjIzZi0xMWU4LWE2NzItZmExNjNlNWE4ZDg3Kh1zcGFya29wZXJhdG9yLms4cy5pby92MWFscGhhMQ== local:///SkyDiscovery/cephfs/user/username/spark-examples_2.11-2.3.0.jar]
I0528 06:20:20.672806 1 initializer.go:292] Processing Spark driver pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-driver
I0528 06:20:20.672905 1 initializer.go:439] Removed initializer on pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-driver
I0528 06:20:21.194235 1 initializer.go:292] Processing Spark driver pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-driver
I0528 06:20:21.194350 1 initializer.go:439] Removed initializer on pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-driver
I0528 06:20:21.238599 1 submission_runner.go:99] spark-submit completed for SparkApplication spark-pi001 in namespace default
I0528 06:20:21.749293 1 submission_runner.go:99] spark-submit completed for SparkApplication spark-pi001 in namespace default
I0528 06:21:53.275856 1 initializer.go:292] Processing Spark executor pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-exec-2
I0528 06:21:53.275968 1 initializer.go:439] Removed initializer on pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-exec-2
I0528 06:21:53.576619 1 initializer.go:292] Processing Spark executor pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-exec-1
I0528 06:21:53.576683 1 initializer.go:439] Removed initializer on pod spark-pi001-090fd4fdefb131e08b4142efea55eca9-exec-1
I0528 06:21:58.583400 1 initializer.go:292] Processing Spark executor pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-exec-2
I0528 06:21:58.583499 1 initializer.go:439] Removed initializer on pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-exec-2
I0528 06:21:58.779218 1 initializer.go:292] Processing Spark executor pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-exec-1
I0528 06:21:58.779334 1 initializer.go:439] Removed initializer on pod spark-pi001-af2e021c6dcc37f3b404093f37da3fe8-exec-1

*However, when i use "kubectl apply -f .yaml" to create sparkApplication, i got the only driver.

I am new to k8s. Is there anyway to get the only driver and service.

Support automatic RBAC setup on a per-application basis

Spark driver pods need certain permissions to be able to create and watch executor pods. The service account used by the driver pod must have the appropriate permission for the driver to be able to do its work. The Spark operator should support automatically creating a service account with the necessary permissions for the driver pods to run. Specifically, it needs to create a service account, a ClusterRole and a ClusterRoleBinding. It probably also should support automatically creating the namespace of a SparkApplication if it doesn't exist.

Investigate how updates to existing SparkApplication objects should be handled

Integration with Prometheus for metrics and monitoring

Implement sparkctl: a cmd tool for working with the operator

Spark operator roadmap for 2018

This is an issue for tracking the 2018 roadmap for the Spark Operator. The list below is tentative and may be constantly updated as ideas and thoughts pop up.

Features

Documentations

Publish user guide and API definition doc.
Publish design doc.
Document v1alpha version of the SparkApplication API.

If both hadoopConfigMap and sparkConfigMap are specified, spark operator does not create driver

I noticed that whenever both hadoopConfigMap and sparkConfigMap are specified, there is an exception raised

However, if I specify them alone, it works e.g.

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: SparkApplication
metadata:
  name: secure-eos-events-select
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "gitlab-registry.cern.ch/db/spark-service/docker-registry/spark-on-k8s:v2.3.0-xrootd-s3"
  imagePullPolicy: Always
  mainClass: org.sparkservice.sparkrootapplications.examples.EventsSelect
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-service-examples_2.11-0.0.1.jar"
  arguments:
    - "root://eosuser.cern.ch/eos/user/p/pmrowczy/Downloads/evsel.root"
  deps:
    jars:
      - http://central.maven.org/maven2/org/diana-hep/root4j/0.1.6/root4j-0.1.6.jar
      - http://central.maven.org/maven2/org/apache/bcel/bcel/5.2/bcel-5.2.jar
      - http://central.maven.org/maven2/org/diana-hep/spark-root_2.11/0.1.16/spark-root_2.11-0.1.16.jar
  #hadoopConfigMap: secure-eos-events-select-hadoop-conf #<- does not work with sparkConfigMap
  sparkConfigMap: secure-eos-events-select-spark-conf #<- does not work with hadoopConfigMap
  sparkConf:
    "spark.kubernetes.driverEnv.KRB5CCNAME": /usr/lib/hadoop/etc/hadoop/krb5cc_0
    "spark.executorEnv.KRB5CCNAME": /usr/lib/hadoop/etc/hadoop/krb5cc_0
#    "spark.kubernetes.driverEnv.KRB5CCNAME": /etc/hadoop/conf/krb5cc_0
#    "spark.executorEnv.KRB5CCNAME": /etc/hadoop/conf/krb5cc_0
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "1024m"
    labels:
      version: 2.3.0
    serviceAccount: spark
    configMaps:
      - name: secure-eos-events-select-hadoop-conf #<- does work
        path: /usr/lib/hadoop/etc/hadoop/
#    configMaps:
#      - name: secure-eos-events-select-spark-conf #<- does work
#        path: /opt/spark/conf
  executor:
    instances: 1
    cores: 2
    memory: "2048m"
    labels:
      version: 2.3.0
    configMaps:
      - name: secure-eos-events-select-hadoop-conf #<- does work
        path: /usr/lib/hadoop/etc/hadoop/
#    configMaps:
#      - name: secure-eos-events-select-spark-conf  #<- does work
#        path: /opt/spark/conf
  restartPolicy: Never

Below there is an exception raised

Name:         secure-eos-events-select
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1alpha1
Kind:         SparkApplication
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-05-24T13:51:57Z
  Generation:          1
  Resource Version:    104967
  Self Link:           /apis/sparkoperator.k8s.io/v1alpha1/namespaces/default/sparkapplications/secure-eos-events-select
  UID:                 a028cf32-5f59-11e8-a690-02163e01c159
Spec:
  Arguments:
    root://eosuser.cern.ch/eos/user/p/pmrowczy/Downloads/evsel.root
  Deps:
    Jars:
      http://central.maven.org/maven2/org/diana-hep/root4j/0.1.6/root4j-0.1.6.jar
      http://central.maven.org/maven2/org/apache/bcel/bcel/5.2/bcel-5.2.jar
      http://central.maven.org/maven2/org/diana-hep/spark-root_2.11/0.1.16/spark-root_2.11-0.1.16.jar
  Driver:
    Core Limit:  1000m
    Cores:       1
    Labels:
      Version:        2.3.0
    Memory:           1024m
    Service Account:  spark
  Executor:
    Cores:      2
    Instances:  1
    Labels:
      Version:            2.3.0
    Memory:               2048m
  Hadoop Config Map:      secure-eos-events-select-hadoop-conf
  Image:                  gitlab-registry.cern.ch/db/spark-service/docker-registry/spark-on-k8s:v2.3.0-xrootd-s3
  Image Pull Policy:      Always
  Main Application File:  local:///opt/spark/examples/jars/spark-service-examples_2.11-0.0.1.jar
  Main Class:             org.sparkservice.sparkrootapplications.examples.EventsSelect
  Mode:                   cluster
  Restart Policy:         Never
  Spark Config Map:       secure-eos-events-select-spark-conf
  Type:                   Scala
Status:
  App Id:  secure-eos-events-select-4075353935
  Application State:
    Error Message:  Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [default]  failed.
                    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
                    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
                    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:363)
                    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$3.apply(KubernetesClientApplication.scala:145)
                    at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$3.apply(KubernetesClientApplication.scala:144)
                    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
                    at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:144)
                    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:235)
                    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
                    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
                    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
                    at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
                    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
                    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
                    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
                    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
                    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
  at okio.Okio$4.newTimeoutException(Okio.java:230)
  at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
  at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
  at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
  at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
  at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
  at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
  at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
  at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
  at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
  at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
  at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
  at okhttp3.RealCall.execute(RealCall.java:69)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356)
  ... 14 more
Caused by: java.net.SocketException: Socket closed
  at java.net.SocketInputStream.read(SocketInputStream.java:204)
  at java.net.SocketInputStream.read(SocketInputStream.java:141)
  at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
  at sun.security.ssl.InputRecord.read(InputRecord.java:503)
  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
  at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
  at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
  at okio.Okio$2.read(Okio.java:139)
  at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
  ... 41 more

    State:          SUBMISSION_FAILED
  Completion Time:  <nil>
  Driver Info:
    Web UI Port:          30289
    Web UI Service Name:  secure-eos-events-select-4075353935-ui-svc
  Submission Time:        2018-05-24T13:51:57Z
Events:
  Type     Reason                            Age   From            Message
  ----     ------                            ----  ----            -------
  Normal   SparkApplicationAdded             17s   spark-operator  SparkApplication secure-eos-events-select was added, enqueued it for submission
  Warning  SparkApplicationSubmissionFailed  3s    spark-operator  SparkApplication secure-eos-events-select failed submission: Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [default]  failed.

Alpha release v0.1

This is to track the following work items towards the v0.1 alpha release.

@foxish

Investigate enabling cache resync and re-list upon controller restarting

Currently, cache resync is disabled for SparkApplication objects. This has a potential problem: the controller may miss updates to SparkApplication objects in cases the watch connection is lost and re-established later or the controller gets restarted during an upgrade. To prevent the controller from missing update events, it's critical to enable cache re-sync.

A related issue is how to handle re-listed SparkApplication objects upon controller restarting. The controller needs to determine if it should submit for each existing SparkApplication object.

Not able to launch a spark operator using standard docker

Here's the full stack trace,

I0117 19:27:35.500409 1 main.go:71] Checking the kube-dns add-on
I0117 19:27:35.523940 1 main.go:76] Starting the Spark operator
I0117 19:27:35.524650 1 controller.go:136] Starting the SparkApplication controller
I0117 19:27:35.524664 1 controller.go:138] Creating the CustomResourceDefinition sparkapplications.sparkoperator.k8s.io
W0117 19:27:35.539429 1 crd.go:69] CustomResourceDefinition sparkapplications.sparkoperator.k8s.io already exists
I0117 19:27:35.539489 1 controller.go:144] Starting the SparkApplication informer
I0117 19:27:35.639770 1 controller.go:151] Starting the workers of the SparkApplication controller
I0117 19:27:35.640005 1 spark_pod_monitor.go:109] Starting the Spark Pod monitor
I0117 19:27:35.640012 1 spark_pod_monitor.go:112] Starting the Pod informer of the Spark Pod monitor
I0117 19:27:35.640065 1 initializer.go:112] Starting the Spark Pod initializer
I0117 19:27:35.640073 1 initializer.go:164] Adding the InitializerConfiguration spark-pod-initializer-config
I0117 19:27:35.640194 1 submission_runner.go:58] Starting the spark-submit runner
F0117 19:27:35.645496 1 main.go:96] failed to create InitializerConfiguration spark-pod-initializer-config: the server could not find the requested resource (post initializerconfigurations.admissionregistration.k8s.io)

Investigating migrating sparkctl to be based on kubectl plugins

Improve unit test coverage

RpcOutboxMessage - Ask timeout before connecting successfully when creating ScheduledSparkApplication jobs

I am getting the RPC timeout between driver and executors. The simple sparkapplication works fine.

Here is the logs from executor pod.

	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
	at scala.util.Try$.apply(Try.scala:192)
	at scala.util.Failure.recover(Try.scala:216)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
	at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.complete(Promise.scala:55)
	at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
	at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
	at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from pyspark-example-1525728618655618581-1525728619985-driver-svc.default.svc.cluster.local:7078 in 120 seconds
	... 8 more

The simple sparkapplication works fine:

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: SparkApplication
metadata:
  name: pyspark-job-example
  namespace: default
spec:
  type: Python
  mode: cluster
  mainApplicationFile: "local:///code/pyspark-example.py"
  Arguments:
  - "wasb://[email protected]/episodes.avro"
  deps:
    jars:
      - "local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar" 
  sparkConf:
    "spark.hadoop.fs.azure.account.key.sparkdkakadia.blob.core.windows.net": "4gIguFMiymTI4BoHL2LsbIdEVjDYUZfcgjYqAP6T+/eUXWH9RzJPQAlQpUsT1+hAAsVJdlZd9fBQ+ctZV+I55A=="
  driver:
    cores: 0.5
    image: dharmeshkakadia/spark-py-example:latest
    memory: "2048m"
    labels:
      version: 2.2.0
    serviceAccount: spark
  executor:
    cores: 2
    instances: 2
    image: dharmeshkakadia/spark-executor-py:latest
    memory: "1024m"
    labels:
      version: 2.2.0
  restartPolicy: Never

However the same spec does not work under ScheduledSparkApplication.

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: ScheduledSparkApplication
metadata:
  name: pyspark-example
  namespace: default
spec:
  schedule: "*/10 * * * *"
  concurrencyPolicy: Replace
  runHistoryLimit: 3
  template:
    type: Python
    mode: cluster
    mainApplicationFile: "local:///code/pyspark-example.py"
    Arguments:
    - "wasb://[email protected]/episodes.avro"
    deps:
      jars:
        - "local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar" 
    sparkConf:
      "spark.hadoop.fs.azure.account.key.sparkdkakadia.blob.core.windows.net": "4gIguFMiymTI4BoHL2LsbIdEVjDYUZfcgjYqAP6T+/eUXWH9RzJPQAlQpUsT1+hAAsVJdlZd9fBQ+ctZV+I55A=="
    driver:
      cores: 0.5
      image: dharmeshkakadia/spark-py-example:latest
      memory: "2048m"
      labels:
        version: 2.2.0
      serviceAccount: spark
    executor:
      cores: 2
      instances: 2
      image: dharmeshkakadia/spark-executor-py:latest
      memory: "1024m"
      labels:
        version: 2.2.0
    restartPolicy: Never

Allow follow log option (-f) in sparkctl

Currently we repeatedly run sparkctl log command to get the new log lines. It would be good to have -f option that shows the log as its updated.

Use jasonschema to validate CRD objects

Allow sparkctl flags for create being specified inside the application yaml

Currently, flags as --upload-to, --project and --namespace are passed separately from application yaml file. Could it be an interesting addition, to allow in sparkctl create to define these parameters inside the yaml?

This way SparkApplication yaml could specify which gcp project (in case of s3 - profile) to use, and location to upload dependencies for this specific app.

RestartPolicy: Always ends up with two drivers after the first restart event

Issue Description

Installing from the user-guide and running the sample spark-pi.yaml application with the modification to make it RestartPolicy: Always causes two drivers to be created after the first run completes.

Timeline of events

Install the operator onto a 1.8.10 cluster:

$ kubectl create -f manifest/

Nothing is running:

$ kubectl get pods
No resources found.

Modify the restart policy to Always, and submit the sample application:

$ kubectl create -f examples/spark-pi.yaml 
sparkapplication "spark-pi" created

Logs of the submission:

$ kubectl logs -n sparkoperator sparkoperator-7598b5bb6-k2jm6 -f
I0518 20:59:48.813207       1 sparkui.go:74] Creating a service spark-pi-3506525398-ui-svc for the Spark UI for application spark-pi
I0518 20:59:48.837560       1 submission_runner.go:81] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://10.3.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=default --conf spark.app.name=spark-pi --conf spark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0 --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi-3506525398 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.driver.cores=0.100000 --conf spark.kubernetes.driver.limit.cores=200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-id=spark-pi-3506525398 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.3.0 --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/ownerreference=ChBTcGFya0FwcGxpY2F0aW9uGghzcGFyay1waSIkNjZjMzE3ZmQtNWFkZS0xMWU4LWE3ZDgtMDI0Y2Y1NjlmMTZhKh1zcGFya29wZXJhdG9yLms4cy5pby92MWFscGhhMQ== local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar]

First driver and executor running:

$ kubectl get pods
NAME                                               READY     STATUS    RESTARTS   AGE
spark-pi-f01a49537f403ab39455f44f9132e36a-driver   1/1       Running   0          50s
spark-pi-f01a49537f403ab39455f44f9132e36a-exec-1   1/1       Running   0          2s

The application completes the calculation of Pi. The controller logs:

I0518 21:00:58.948971       1 controller.go:597] SparkApplication spark-pi failed or terminated, restarting it with RestartPolicy Always
I0518 21:00:59.046457       1 sparkui.go:74] Creating a service spark-pi-1742375757-ui-svc for the Spark UI for application spark-pi
I0518 21:00:59.052603       1 controller.go:597] SparkApplication spark-pi failed or terminated, restarting it with RestartPolicy Always
I0518 21:00:59.082823       1 submission_runner.go:81] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://10.3.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=default --conf spark.app.name=spark-pi --conf spark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0 --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi-1742375757 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.driver.cores=0.100000 --conf spark.kubernetes.driver.limit.cores=200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-id=spark-pi-1742375757 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.3.0 --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/ownerreference=ChBTcGFya0FwcGxpY2F0aW9uGghzcGFyay1waSIkNjZjMzE3ZmQtNWFkZS0xMWU4LWE3ZDgtMDI0Y2Y1NjlmMTZhKh1zcGFya29wZXJhdG9yLms4cy5pby92MWFscGhhMQ== local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar]
E0518 21:00:59.408970       1 controller.go:612] failed to delete old driver pod spark-pi-f01a49537f403ab39455f44f9132e36a-driver of SparkApplication spark-pi: pods "spark-pi-f01a49537f403ab39455f44f9132e36a-driver" not found
I0518 21:00:59.414244       1 sparkui.go:74] Creating a service spark-pi-1588551100-ui-svc for the Spark UI for application spark-pi
I0518 21:00:59.836781       1 submission_runner.go:81] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://10.3.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=default --conf spark.app.name=spark-pi --conf spark.kubernetes.container.image=gcr.io/ynli-k8s/spark:v2.3.0 --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-id=spark-pi-1588551100 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.driver.cores=0.100000 --conf spark.kubernetes.driver.limit.cores=200m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.label.version=2.3.0 --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-id=spark-pi-1588551100 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.3.0 --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumemounts.test-volume=Cgt0ZXN0LXZvbHVtZRAAGgQvdG1wIgA= --conf spark.kubernetes.executor.annotation.sparkoperator.k8s.io/volumes.test-volume=Cgt0ZXN0LXZvbHVtZRITChEKBC90bXASCURpcmVjdG9yeQ== --conf spark.kubernetes.driver.annotation.sparkoperator.k8s.io/ownerreference=ChBTcGFya0FwcGxpY2F0aW9uGghzcGFyay1waSIkNjZjMzE3ZmQtNWFkZS0xMWU4LWE3ZDgtMDI0Y2Y1NjlmMTZhKh1zcGFya29wZXJhdG9yLms4cy5pby92MWFscGhhMQ== local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar]
I0518 21:01:02.913068       1 submission_runner.go:99] spark-submit completed for SparkApplication spark-pi in namespace default
I0518 21:01:03.915572       1 submission_runner.go:99] spark-submit completed for SparkApplication spark-pi in namespace default

And now, two drivers and their executors are running:

$ kubectl get pod
NAME                                               READY     STATUS    RESTARTS   AGE
spark-pi-55f4dc6f838933ea9cde3cb54826a448-driver   1/1       Running   0          48s
spark-pi-55f4dc6f838933ea9cde3cb54826a448-exec-1   1/1       Running   0          2s
spark-pi-98bfa61b65903a70859b66af3f7529f8-driver   1/1       Running   0          49s
spark-pi-98bfa61b65903a70859b66af3f7529f8-exec-1   1/1       Running   0          2s

Any ideas for ensuring that only one runs?

Side question: if the driver or any one of the executors were to fail/stop/get rescheduled, does the controller notice and fully delete/redeploy the sparkapplication?

Thanks!

Support for imagePullSecrets

Hello,

We are using our private Docker repository for our images, and as a result when specifying a image for container for pod we have to specify imagePullSecrets.

Right now as I can see it is not supported. I can see that normal Secrets are supported, but not imagePullSecrets which are used to authenticate with Docker repository. Please correct me if I'm wrong.

Do you plan to add this support? I can and create a PR for that if that helps.

Thank you.

Allow uploading dependencies to s3

Currently, there is no implementation for s3 - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/sparkctl/cmd/create.go#L273

One solution, reusing existing flags and envs (--project and GOOGLE_APPLICATION_CREDENTIALS) could be to have --project as s3 profile and some other ENV to point to configuration file, structured as:

[<project>]
s3_access_key_id=<key>
s3_secret_access_key=<secret>
region=<region>
output=json

Could that be a correct approach?

spark-operator-rbac should only grant required roles

spark-operator-rbac.yaml currently grants ALL roles to the spark-operator service account.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: sparkoperator
rules:
- apiGroups:
  - "*"
  resources:
  - "*"
  verbs:
  - "*"

it should only be granted the roles it actually needs.

Sync with apache/spark/master

Notable changes in apache/spark/master post 2.3.0:

New config key for specifying physical executor cpu requests: apache/spark#20460 (covered in #158).
Removal of the init-container: apache/spark#20669. Note that this PR also introduced a ConfigMap carrying Spark configuration properties in a file for the driver. The environment variable SPARK_CONF_DIR is set to point to the mount path /opt/spark/conf of the ConfigMap. This is in conflict with what spec.sparkConfigMap is designed to do. Created #216.
New config key for specifying image pull secrets: apache/spark#20811 (covered in #158).
Refactoring (internal and won't affect the operator): apache/spark#20910.
Memory request change (won't affect the operator): apache/spark#20943.
PySpark support: apache/spark#21092 (the operator should work for PySpark out-of-the-box similarly to what branch spark-2.2-support is) (covered in #181).
New config mechanism for specifying and mounting volumes: apache/spark#21260 (won't use this as it currently only supports hostPath, emptyDir, and PVC).
New config key for specifying secretKeyRef: apache/spark#21317 (covered in #158).

Add support for mounting user-specified volumes

Mounting volumes can be a way of making users' application dependencies available, additionally to HTTP/HTTPs servers, GCS, S3, etc.

Start looking at adding online model serving on the otherside

Batch-only-training is cool, but do you know whats cooler? E2E training to serving!

We could look at the ML pipeline stage as the starting point. But maybe add trait for exportable to.

Changed the import paths to use github.com/GoogleCloudPlatform/spark-on-k8s-operator instead of k8s.io

Spark executors won't start

I have a docker image for a Spark 2.3 job that I could run successfully on Kubernetes using spark-submit. But when I started the job using the operator, the only things that got started were the driver pod and the UI svc, no Spark executors were launched. I didn't see any error in the sparkoperator pod and my app's driver pod. Anyone know what might be causing this issue? Btw, running the example spark-pi job using the operator works fine (executors got started etc).

Specifying driver and executor pod properties

Starting a couple of discussions around specification of properties.

spec:
  deps: {}
  driver:
    cores: "0.1"
    image: liyinan926/spark-driver:v2.3.0
  executor:
    image: liyinan926/spark-executor:v2.3.0
    instances: 1
    memory: 512m

Should this mimic the driver/executor properties more closely?
For example: spark.kubernetes.driver.limit.cores.

/xref: https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-docs/_site/running-on-kubernetes.html

Multiple spark versions for breaking changes in spark submit

Suppose a user's cluster has a Spark operator bundled with version 2.3 of Spark Submit. Suppose then we introduce a breaking change in the contract between Spark Submit and the Spark application docker containers. Therefore any Spark application using Spark 2.4 needs to use Spark 2.4's spark submit, and conversely, all applications using Spark 2.3 need to use Spark 2.3's spark submit.

Unfortunately it's not easy right now to support having a single Spark operator deploying both Spark 2.3 and Spark 2.4 applications. If you upgrade the Spark operator's bundled Spark submit version to 2.4, suddenly all applications running Spark 2.3 will break.

There are multiple ways to tackle this problem. First of all, we should be ensuring that the contract between spark submit and the Spark driver it launches is stable. This has been the case for a long time with Spark on YARN.

However, we can be clever and allow for the above use case even if the above contract breaks, as follows:

Add a sparkVersion field to the spec of SpatkApplication
The Spark operator comes bundled with a certain Spark submit version that is compatible with some subset of spark versions of Spark applications
A given operator will only deploy a SparkApplication if the sparkVersion declared by that application is compatible with its bundled spark-submit version.

This allows one to implement something like a rolling upgrade strategy for Spark applications run on their cluster. As an example, a cluster administrator can install a Spark 2.4 compatible Spark operator and leave the Spark 2.3 operator running on the cluster. Once the organization has upgraded all Spark applications to Spark 2.4 and higher, the Spark 2.3 operator can be decommissioned.

Tasks:

Add a new field for specifying the Spark version.
Use a Kubernetes Job to run spark-submit.

Make the operator work for PySpark in spark master

The operator currently does not support PySpark, which is available now in the master branch of Spark. The following changes are needed to make the operator support PySpark in the master branch:

Make the operator use the new Spark configuration properties introduced in apache/spark#21092.
Create a version of the operator image based on a Spark image of the master branch.

support for operating on a specific namespace

the current spark operator setup assumes that it runs in its own namespace and manages the sparkapp crds for the whole cluster. we were hoping that we could deploy this alongside our application which currently has the requirement that it has namespace isolation. with our current setup, if we have multiple applications on our cluster, each spark-operator sees all the crds on the cluster regardless of namespace. i was hoping i could restrict this with different role bindings but alas...

E0529 21:44:38.579020       1 reflector.go:205] k8s.io/spark-on-k8s-operator/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.SparkApplication: sparkapplications.sparkoperator.k8s.io is forbidden: User "system:serviceaccount:sreisor:ideal-tuatara-sparkoperator" cannot list sparkapplications.sparkoperator.k8s.io at the cluster scope
E0529 21:44:38.579370       1 reflector.go:205] k8s.io/spark-on-k8s-operator/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.ScheduledSparkApplication: scheduledsparkapplications.sparkoperator.k8s.io is forbidden: User "system:serviceaccount:sreisor:ideal-tuatara-sparkoperator" cannot list scheduledsparkapplications.sparkoperator.k8s.io at the cluster scope

is there anything we can do to restrict spark operator listing only within its namespace or a specific namespace? it's our first dependency that would break our ability to helm install our application into an isolated namespace.

Finalize SparkApplication API for alpha release

CronJob(ScheduledSparkApplication) does not launch any pods

I am able to run python spark applications without any issues, but when I add the schedule and change from SparkApplication to ScheduledSparkApplication (with other required changes) it fails to start any pods. It creates the service which fails with following and nothing happens.

Error creating load balancer (will retry): error getting LB for service default/spark-my-wc-py-1525717818370411106-365697849-ui-svc: Service(default/spark-my-wc-py-1525717818370411106-365697849-ui-svc) - Loadbalancer not found

Env:
I am running the 2.2 fork which has the python support.

kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:54Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Here is the deployment YAML for scheduledsparkapplication .

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: ScheduledSparkApplication
metadata:
  name: pyspark-example
  namespace: default
spec:
  schedule: "*/10 * * * *"
  concurrencyPolicy: Replace
  runHistoryLimit: 3
  template:
    type: Python
    mode: cluster
    mainApplicationFile: "/code/pyspark-example.py"
    Arguments:
    - "wasb://[email protected]/episodes.avro"
    deps:
      jars:
        - "local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar" 
    sparkConf:
      "spark.hadoop.fs.azure.account.key.sparkdkakadia.blob.core.windows.net": "4gIguFMiymTI4BoHL2LsbIdEVjDYUZfcgj+/eUXWH9RzJPQAlQpUsT1+hAAsVJdlZd9fBQ+ctZV+I55A=="
    driver:
      cores: 0.1
      image: dharmeshkakadia/spark-py-example:latest
      coreLimit: "200m"
      memory: "1024m"
      labels:
        version: 2.2.0
      serviceAccount: spark
    executor:
      cores: 1
      instances: 2
      image: dharmeshkakadia/spark-executor-py:latest
      memory: "1024m"
      labels:
        version: 2.2.0
    restartPolicy: Never

Python support

Hi,

Thanks for the operator.

I was trying to play around and looking at the documentation (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md) I see Python as a supported type in the SparkApplication Spec but i am running into "Python applications are currently not supported for Kubernetes." .

I am assuming its stemming from the base spark image: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L332 . Thoughts ?

I can share more details if you believe this can be an issue on my end.

resources are not cleaned up when a sparkapp is deleted

right now the onDelete handler in the sparkapp controller doesn't really do anything. this is the last opportunity for it to clean up the pods / services created. seems like deleteDriverAndUIService() should be called in onDelete()

Update the CustomResourceDefinitions if `install-crds=true` and they already exist

A better strategy for life cycle management of the CustomResouceDefinition resource for SparkApplication is needed. Currently, the controller creates the CustomResouceDefinition resource upon starting and deletes it upon terminating. During a controller restart, existing SparkApplication objects will be gone and users won't be able to create new SparkApplication objects before the controller starts up again.

Quick start guide and detailed user guide

Add documentation on exporting SparkApplication events to Stackdriver

SparkApplication events can be exported to Stackdriver logging using https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/event-exporter. Event exporting to Stackdriver logging is enabled by default on GKE. We need better documentation.