Demo: Running Spark Examples on minikube¶

This demo shows how to run the Spark example applications on minikube to advance from spark-shell on minikube to a more serious deployment (yet with no Spark development).

This demo lets you explore deploying a Spark application (e.g. SparkPi) to Kubernetes in cluster deploy mode.

Before you begin¶

Start up minikube with necessary Kubernetes resources.

Follow the steps in Demo: spark-shell on minikube and Demo: Running Spark Application on minikube:

Running SparkPi on minikube¶

cd $SPARK_HOME

K8S_SERVER=$(k config view --output=jsonpath='{.clusters[].cluster.server}')

Let's use an environment variable for the name of the pod to be more "stable" and predictable. It should make viewing logs and restarting Spark examples easier. Just change the environment variable or delete the pod and off you go!

export POD_NAME=a1

WIP: No use of run-example without spark.kubernetes.file.upload.path

Using the run-example shell script to run the Spark examples will not work unless you define spark.kubernetes.file.upload.path configuration property. The reason is that run-example uses the spark-examples jar file that is found locally so spark-submit (that is used under the covers) has to upload the locally-available resource file to be available in a cluster.

We'll get to it later. Consider it a work in progress.

./bin/run-example \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --name $POD_NAME \
  --jars local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar \
  --conf spark.kubernetes.container.image=spark:v3.2.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  SparkPi 10

./bin/spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --name $POD_NAME \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:v3.2.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar 10

In another terminal, use k get po -w to watch the pods of the driver and executors.

k get po -w

Review the logs of the driver (as long as the driver pod is up and running).

k logs $POD_NAME

For repeatable SparkPi executions, delete the driver pod using k delete pod command.

k delete po $POD_NAME

spark.kubernetes.executor.deleteOnTermination Configuration Property

Use spark.kubernetes.executor.deleteOnTermination configuration property to keep executor pods available once a Spark application is finished (e.g. for examination).

Using Pod Templates¶

spark.kubernetes.driver.podTemplateFile configuration property allows to define a template file for driver pods (e.g. for multi-container pods).

spark.kubernetes.executor.podTemplateFile Configuration Property

Use spark.kubernetes.executor.podTemplateFile configuration property for the template file of executor pods.

Pod Template¶

The following is a very basic pod template file. It is incorrect Spark-wise though (as it does not really allow submitting Spark applications) but does allow playing with pod templates.

spec:
  containers:
  - name: spark
    image: busybox
    command: ['sh', '-c', 'echo "Hello, Spark on Kubernetes!"']

./bin/spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --conf spark.kubernetes.driver.podTemplateFile=pod-template.yml \
  --name $POD_NAME \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:v3.2.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar 10

k logs $POD_NAME

Hello, Spark on Kubernetes!

Container Resources¶

In cluster deploy mode, Spark on Kubernetes may use extra Non-Heap Memory Overhead in memory requirements of the driver pod (based on spark.kubernetes.memoryOverheadFactor configuration property).

k get po $POD_NAME -o=jsonpath='{.spec.containers[0].resources}' | jq

$ k get po $POD_NAME -o=jsonpath='{.spec.containers[0].resources}' | jq
{
  "limits": {
    "memory": "1408Mi"
  },
  "requests": {
    "cpu": "1",
    "memory": "1408Mi"
  }
}

Note that this extra memory requirements could be part of a pod template.

./bin/spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --driver-memory 1g \
  --conf spark.kubernetes.memoryOverheadFactor=0.5 \
  --name $POD_NAME \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:v3.2.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar 10

k get po $POD_NAME -o=jsonpath='{.spec.containers[0].resources}' | jq

$ k get po $POD_NAME -o=jsonpath='{.spec.containers[0].resources}' | jq
{
  "limits": {
    "memory": "1536Mi"
  },
  "requests": {
    "cpu": "1",
    "memory": "1536Mi"
  }
}