Skip to content

Demo: PersistentVolumeClaims

This demo shows how to use PersistentVolumeClaims.

From Persistent Volumes:

minikube supports PersistentVolumes of type hostPath out of the box. These PersistentVolumes are mapped to a directory inside the running minikube instance

The demo uses OnDemand claim name placeholder to create different PersistentVolume claim names for the executors at deployment.

From Binding:

Once bound, PersistentVolumeClaim binds are exclusive, regardless of how they were bound. A PVC to PV binding is a one-to-one mapping, using a ClaimRef which is a bi-directional binding between the PersistentVolume and the PersistentVolumeClaim.

Claims will remain unbound indefinitely if a matching volume does not exist.

Given the demo uses 2 executors you could simply create 2 persistent volumes and be done. More executors would beg for some automation and Kubernetes supports this use case with Dynamic Volume Provisioning:

Dynamic volume provisioning allows storage volumes to be created on-demand.

In this demo you simply create two PVs to get going.

Before you begin

It is assumed that you have finished the following:

Start Cluster

minikube start

Switch the Kubernetes namespace to ours.

kubens spark-demo

Environment Variables

export K8S_SERVER=$(k config view --output=jsonpath='{.clusters[].cluster.server}')

export POD_NAME=meetup-spark-app
export IMAGE_NAME=$POD_NAME:0.1.0

export VOLUME_NAME=my-pv
export MOUNT_PATH=/mnt/data
export PVC_STORAGE_CLASS=manual
export PVC_SIZE_LIMIT=1Gi

Create PersistentVolumes

The following commands are a copy of Configure a Pod to Use a PersistentVolume for Storage (with some minor changes).

minikube ssh
sudo mkdir /data/pv0001
sudo sh -c "echo 'Hello from Kubernetes storage' > /data/pv0001/hello-message"

Create a PersistentVolume as described in Configure a Pod to Use a PersistentVolume for Storage.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0001
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: /data/pv0001/
  storageClassName: manual
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0002
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: /data/pv0001/
  storageClassName: manual
k apply -f k8s/pvs.yaml
k get pv
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
pv0001   2Gi        RWO            Retain           Available           manual                  5s
pv0002   2Gi        RWO            Retain           Available           manual                  5s
k describe pv pv0001
Name:            pv0001
Labels:          <none>
Annotations:     <none>
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    manual
Status:          Available
Claim:
Reclaim Policy:  Retain
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        2Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/pv0001/
    HostPathType:
Events:            <none>

Claim Persistent Volume

You're going to use spark-shell Spark application and executors only are provisioned on Kubernetes.

cd $SPARK_HOME

Spark on Kubernetes sets up Kubernetes' persistentVolumeClaim using spark.kubernetes.executor.volumes-prefixed configuration properties for executor pods.

Spark on Kubernetes uses 2 executors by default (--num-executors 2) and that is why the demo uses OnDemand claim name to generate different PV claim names at deployment.

Watch Persistent Volume Claims

In a separate terminal use the following command to watch persistent volume claims as they are created.

k get pvc -w

The output will be similar to the following:

NAME                                        STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
spark-shell-de6ed2783a61d962-exec-1-pvc-0   Pending                                      manual         0s
spark-shell-de6ed2783a61d962-exec-1-pvc-0   Pending   pv0001   0                         manual         0s
spark-shell-de6ed2783a61d962-exec-1-pvc-0   Bound     pv0001   2Gi        RWO            manual         0s
spark-shell-de6ed2783a61d962-exec-2-pvc-0   Pending                                      manual         0s
spark-shell-de6ed2783a61d962-exec-2-pvc-0   Pending   pv0002   0                         manual         0s
spark-shell-de6ed2783a61d962-exec-2-pvc-0   Bound     pv0002   2Gi        RWO            manual         0s

Start Spark Shell

./bin/spark-shell \
  --master k8s://$K8S_SERVER \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=OnDemand \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.storageClass=$PVC_STORAGE_CLASS \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.sizeLimit=$PVC_SIZE_LIMIT \
  --conf spark.kubernetes.container.image=$IMAGE_NAME \
  --conf spark.kubernetes.context=minikube \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose

Note

You have to recreate the persistent volumes (Clean Up and Create PersistentVolumes) before running spark-shell again (per Retain reclaim policy).

Review Kubernetes Resources

persistentVolumes

k describe pv $(k get pv -o=jsonpath='{.items[0].metadata.name}')
Name:            pv0001
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    manual
Status:          Bound
Claim:           spark-demo/spark-shell-de6ed2783a61d962-exec-1-pvc-0
Reclaim Policy:  Retain
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        2Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/pv0001/
    HostPathType:
Events:            <none>

persistentVolumeClaims

k get pvc
NAME                                        STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
spark-shell-de6ed2783a61d962-exec-1-pvc-0   Bound    pv0001   2Gi        RWO            manual         2m19s
spark-shell-de6ed2783a61d962-exec-2-pvc-0   Bound    pv0002   2Gi        RWO            manual         2m19s
k describe pvc $(k get pvc -o=jsonpath='{.items[0].metadata.name}')
Name:          spark-shell-de6ed2783a61d962-exec-1-pvc-0
Namespace:     spark-demo
StorageClass:  manual
Status:        Bound
Volume:        pv0001
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       spark-shell-de6ed2783a61d962-exec-1
Events:        <none>

Access PersistentVolume

Command Line

k exec -ti $(k get po -o name) -- cat $MOUNT_PATH/hello-message
Hello from Kubernetes storage

Spark Shell

Note

Wish I knew how to show that the only executor has got access to the mounted volume, but nothing comes to my mind. If you happen to know how to demo it, please contact me at jacek@japila.pl. Thank you! ❤️

Clean Up

k delete po --all
k delete pvc --all
k delete pv --all

Clean up the cluster as described in Demo: spark-shell on minikube.

That's it. Congratulations!


Last update: 2021-03-16