Skip to content

Demo: Spark and Local Filesystem in minikube

The demo shows how to set up a Spark application on minikube to access files on a local filesystem.

Danger

The demo stopped working for some reasons I cannot explain and sort out. The error message is the following:

Warning  Failed     13s   kubelet            Error: failed to start container "spark-kubernetes-driver": Error response from daemon: error while creating mount source path '/tmp/spark-k8s-demo/mount': mkdir /tmp/spark-k8s-demo: file exists

The demo uses spark-submit --files and spark.kubernetes.file.upload.path configuration property to upload a static file to a directory that is then mounted to Spark application pods.

Volumes in Kubernetes are directories which are accessible to the containers in a pod. In order to use a volume, you should specify the volumes to provide for the Pod in .spec.volumes and declare where to mount those volumes into containers in .spec.containers[*].volumeMounts.

Before you begin

It is assumed that you have finished the following:

Start Cluster

minikube start

Environment Variables

export K8S_SERVER=$(k config view --output=jsonpath='{.clusters[].cluster.server}')
export POD_NAME=meetup-spark-app
export IMAGE_NAME=$POD_NAME:0.1.0

export SOURCE_DIR=/tmp/spark-k8s-demo
export VOLUME_TYPE=hostPath
export VOLUME_NAME=demo-host-mount
export MOUNT_PATH=$SOURCE_DIR/mount

Mounting Filesystems

Let's kick things off by mounting a host directory (/tmp/spark-k8s) to minikube.

Quoting Mounting filesystems of minikube's official documentation:

To mount a directory from the host into the guest use the mount subcommand

minikube mount $SOURCE_DIR:$MOUNT_PATH
πŸ“  Mounting host path /tmp/spark-k8s-demo into VM as /tmp/spark-k8s-demo ...
    β–ͺ Mount type:
    β–ͺ User ID:      docker
    β–ͺ Group ID:     docker
    β–ͺ Version:      9p2000.L
    β–ͺ Message Size: 262144
    β–ͺ Permissions:  755 (-rwxr-xr-x)
    β–ͺ Options:      map[]
    β–ͺ Bind Address: 127.0.0.1:53819
πŸš€  Userspace file server: ufs starting
βœ…  Successfully mounted /tmp/spark-k8s-demo to /tmp/spark-k8s-demo

πŸ“Œ  NOTE: This process must stay alive for the mount to be accessible ...

Validating Mount

Use minikube ssh to validate the volume on the minikube's VM.

minikube ssh
docker@minikube:~$ ls -ld /tmp/spark-k8s-demo
drwxr-xr-x 1 docker docker 96 Mar  9 14:11 /tmp/spark-k8s-demo

Using Kubernetes Volumes

Quoting Using Kubernetes Volumes of Apache Spark's official documentation:

users can mount the following types of Kubernetes volumes into the driver and executor pods:

  • hostPath: mounts a file or directory from the host node’s filesystem into a pod.
  • emptyDir: an initially empty volume created when a pod is assigned to a node.
  • persistentVolumeClaim: used to mount a PersistentVolume into a pod.

hostPath

Let's use Kubernetes' hostPath that requires spark.kubernetes.*.volumes-prefixed configuration properties for the driver and executor pods:

--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH

The demo uses configuration properties to set up a hostPath volume type ($VOLUME_TYPE) with $VOLUME_NAME name and $MOUNT_PATH path on the host (for the driver and executors separately).

cd $SPARK_HOME

The idea is to let spark-submit upload the --files to spark.kubernetes.file.upload.path directory that is available under the same directory (logically by name as on the host). That's why the names of the source and target mounted directories are the same.

./bin/spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --files test.me \
  --conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
  --conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
  --conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
  --conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
  --conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
  --name $POD_NAME \
  --class meetup.SparkApp \
  --conf spark.kubernetes.container.image=$IMAGE_NAME \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.context=minikube \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/jars/meetup.meetup-spark-app-0.1.0.jar

Reviewing Volumes

Describe Pod

k describe po $POD_NAME

Pod Volumes

k get po $POD_NAME -o=jsonpath='{.spec.volumes}' | jq
[
  {
    "hostPath": {
      "path": "/tmp/spark-k8s-demo",
      "type": ""
    },
    "name": "demo-host-mount"
  },
  {
    "emptyDir": {},
    "name": "spark-local-dir-1"
  },
  {
    "configMap": {
      "defaultMode": 420,
      "items": [
        {
          "key": "log4j.properties",
          "mode": 420,
          "path": "log4j.properties"
        },
        {
          "key": "spark.properties",
          "mode": 420,
          "path": "spark.properties"
        }
      ],
      "name": "spark-drv-757769781753ba14-conf-map"
    },
    "name": "spark-conf-volume-driver"
  },
  {
    "name": "spark-token-kzmdd",
    "secret": {
      "defaultMode": 420,
      "secretName": "spark-token-kzmdd"
    }
  }
]

Volume Mounts

k get po $POD_NAME -o=jsonpath='{.spec.containers[].volumeMounts}' | jq
[
  {
    "mountPath": "/tmp/spark-k8s-demo",
    "name": "demo-host-mount"
  },
  {
    "mountPath": "/var/data/spark-67727eb6-3ca5-4e72-8ea2-20179aca2831",
    "name": "spark-local-dir-1"
  },
  {
    "mountPath": "/opt/spark/conf",
    "name": "spark-conf-volume-driver"
  },
  {
    "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
    "name": "spark-token-kzmdd",
    "readOnly": true
  }
]

Inside Pod

k exec $POD_NAME -- ls -ltr /tmp/
$ cat /tmp/spark-k8s/spark-upload-5ded6fb9-b5e7-4a09-9d0f-d3c8c85add08/test.me
Hello World

Last update: 2021-03-13