Swapping Disks in Kubernetes for Fun and Profit
Introducing the PvcAutoscaler at City Storage Systems
Written by Jakob Schultz-Falk, member of the storage team at City Storage Systems who led the development of the PvcAutoscaler.
Since the introduction of the StatefulSet more and more stateful workloads have been added to the Kubernetes ecosystem. Unfortunately there are still plenty of caveats to running stateful workloads in Kubernetes, some of which are caused by fundamental limitations of the StatefulSet controller and the PVCs it generates.
While stateless workloads enjoy almost boundless elasticity in Kubernetes, stateful workloads, once deployed, are bound by the volumes they mount. These volumes are static and practically immutable, making it difficult to scale to match the ever changing needs of the workloads using them.
This post will present the solution we’ve developed at City Storage Systems aimed at reclaiming stateful elasticity in order to improve cost efficiency and reduce toil.
The Problem
While there is some support in Kubernetes for expanding storage volumes, there is no such mechanism for reducing storage capacity, nor for changing the underlying storage type. In fact the original Kubernetes Enhancement Proposal (KEP) for supporting volume expansion explicitly lists volume reduction as a non-goal due to the complexities of shrinking volumes.
Furthermore, the volumeClaimTemplates section of the StatefulSet spec is currently immutable, preventing operators from expanding volumes via a high level API, though a future enhancement has been proposed.
This leads to a situation where the total capacity provisioned only grows, and any storage related change requires a high-effort and risky migration, inflating the cost of running stateful workloads.
To summarize, the two primary culprits restraining stateful elasticity are:
Only volume expansion is supported, and only by a subset of volume provisioners
Immutable volume claim templates embedded in the StatefulSet spec
Our Objectives
Now that we have a clear understanding of the problem, let’s list the objectives we want to achieve with our new solution.
Volume expansion for growing storage needs
Volume shrinking to reclaim cost overhead from unused storage
Volume modification, e.g. be able to swap out an HDD with an SSD or vice-versa
All the objectives above should be solved by an on-demand, declarative, and toil-free solution which can be used by any software engineer accustomed to Kubernetes and StatefulSets.
Kubernetes Storage Concepts
Before diving into the inner workings of the PvcAutoscaler, let’s briefly introduce some of the core Kubernetes components we will rely on to achieve our objectives. These concepts are described in more detail in the official Kubernetes storage documentation.
Persistent Volumes and Persistent Volume Claims
Persistent Volumes (PV) are resources in Kubernetes which contain information regarding the underlying storage they represent. These are typically (but not necessarily) cloud provider managed disks.
Persistent Volume Claims (PVC) are resources which bind to PVs, thereby reserving usage of the PV across the cluster. Pods in the same namespace as the PVC can then reference the PVC to use the underlying storage.
The StorageClass
In order to provision different types of storage, Kubernetes offers the concept of a StorageClass. These can be used to describe different types of storage, e.g. one StorageClass could be configured to provision PVs backed by SSDs while another provides access to HDDs.
In general the process of provisioning a new PV/PVC combination to be used by a pod, is done by creating a PVC referencing a StorageClass. As the PVC is not bound to any PV at creation the volume provisioner of the StorageClass will attempt to provision a PV and bind it to the PVC.
The StorageClass contains a reference to a volume provisioner along with parameters needed to create a new PV backed by some underlying storage. E.g. a volume provisioner developed by a cloud provider would take the input parameters and provision a network attached disk with the desired configuration.
The StatefulSet
This resource type was added to Kubernetes in order to get stable naming mapped to persistent storage across pod restarts. So while the Deployment object would create pods with unique names on each creation, the StatefulSet will create pods with serial ordinals mapping one-to-one to a PVC of the same ordinal.
The PVCs can either be manually provisioned upfront, or it can be left up to the StatefulSet controller to create the PVCs based on the volume claim templates defined in its spec. As mentioned above the StatefulSet offers no support for changes to the PVCs or the underlying storage once provisioned.
The PvcAutoscaler
Now that we’ve introduced the core storage concepts we will be referencing, we can begin using them to build up our PvcAutoscaler solution. We start out by addressing the shortcomings of the StatefulSet.
Taking Control
Since volume claim templates in StatefulSets are immutable, our first task is to detach them so we can modify them as needed. We accomplish this by creating a new Custom Resource Definition (CRD) called PvcAutoscaler. One PvcAutoscaler resource can contain a single volume claim template and a reference to the StatefulSet whose PVCs it manages.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx
namespace: default
spec:
serviceName: "nginx"
selector:
matchLabels:
app: nginx
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: registry.Kubernetes.io/nginx:latest
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: www
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
storageClassName: hdd-sc
volumeMode: Filesystem
Figure 1: Splitting out the volume claim templates from the StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx
namespace: default
spec:
serviceName: "nginx"
selector:
matchLabels:
app: nginx
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: registry.Kubernetes.io/nginx:latest
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
---
apiVersion: storage.css.com/v1
kind: PvcAutoscaler
metadata:
name: nginx
namespace: default
spec:
statefulSetName: nginx
volumeClaimTemplate:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: www
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
storageClassName: hdd-sc
volumeMode: Filesystem
Figure 2: Moving the volume claim templates into PvcAutoscaler resources
Now that the volume claim templates are isolated into PvcAutoscaler resources we can create a Kubernetes operator for the CRD which will generate PVCs for each pod in the referenced StatefulSet based off of the embedded volume claim template.
Figure 3: PvcAutoscaler becoming responsible for the creation of PVCs for a StatefulSet
We now have PVCs for the pods of a StatefulSet based off of a volume claim template embedded in a PvcAutoscaler custom resource. Unfortunately the StatefulSet controller is completely unaware of these PVCs and as such cannot include them in the Pod specs to make them mountable volumes. This brings us to our next task.
Attaching Volumes
Luckily Kubernetes gives us a tool we can use to ensure PVCs are available as volumes in the pod spec. By adding a mutating webhook targeting the pods of StatefulSets referenced by a PvcAutoscaler, we can look up and attach the PVCs to a pods’ volumes spec prior to validation and creation.
Figure 4: Mutating pod webhook attaching PvcAutoscaler PVCs to StatefulSet pods
After attaching the PVCs, each pod now has a valid spec with volumeMounts all referencing correctly declared volumes.
Expansion Unlocked
After all that work we have returned to a functional setup similar to the plain StatefulSet. We can define a StatefulSet and a number of PvcAutoscaler resources which will manage PVCs and attach them to the pods at creation.
However we have unlocked one great benefit from this exercise. We have gained complete control over the volume claim template embedded in the PvcAutoscaler and can decide what limitations we wish to impose, i.e. we can make properties mutable.
The first target for mutability is the .spec.resources.storage.requests property since our first objective is to allow seamless volume expansion. Allowing volume expansion when the underlying volume provisioner natively supports it is trivial - we just propagate the change in requests to all PVCs and wait for the provisioner to finish the expansion.
In the cases where volume expansion is not natively supported, we would need to follow a more cumbersome process detailed in the next sections.
The Volume Populator
While volume expansion is supported by some volume provisioners, neither shrinking volumes nor changing the underlying storage device are in any way supported. To enable these features we introduce a new custom component, the volume-populator.
PVCs have for some time supported the DataSource property, originally intended to bootstrap a PVC from a volume snapshot. However by enabling the AnyVolumeDataSource gate (which is enabled by default since 1.24) it is possible to use the DataSource property to reference any custom resource (CR).
To leverage this we create a new CRD called the PvcSourcePopulator, which is owned by the volume-populator and whose only purpose is to reference the old PVC, and can itself be referenced in a new PVC through the DataSource property.
apiVersion: storage.css.com/v1
kind: PvcSourcePopulator
metadata:
name: populator-pvc-0-ba51f
namespace: default
spec:
sourcePvcRef:
name: pvc-0-8770e
namespace: default
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-0-ba51f
namespace: default
spec:
accessModes:
- ReadWriteOnce
dataSourceRef:
apiGroup: storage.css.com
kind: PvcSourcePopulator
name: populator-pvc-0-ba51f
resources:
requests:
storage: 2Gi
storageClassName: ssd-sc
volumeMode: Filesystem
Figure 5: A PvcSourcePopulator CR being referenced by a new PVC with a different storage class and smaller requested capacity
The volume-populator is responsible for transferring the content of the old PVC referenced in the PvcSourcePopulator to the new PVC. The process it follows can be summarized in the following bullets:
PVCnew is created with a PvcSourcePopulator data source referencing PVCold. PVCnew enters a Pending state since the volume provisioner does not recognize the datasource type - this allows us to take over the task of eventually binding a PV to PVCnew
The volume-populator monitors pods in the namespace waiting for the PVCnew to be referenced to ensure the data transfer is executed immediately prior to pod initialization
Once PVCnew is referenced, the volume-populator creates an empty PVCtmp with identical content to PVCnew, but without a datasource. Without the datasource, the volume provisioner is able to create a PV and bind it to PVCtmp
The volume-populator spins up a pod which mounts both the PVCold and PVCtmp, and transfers content from PVCold to PVCtmp
Once transfer completes, the volume-populator unbinds the PV from PVCtmp and instead binds it to PVCnew
The volume-populator discards PVCtmp which no longer has a PV bound to it
PVCnew becomes ready as it now has a PV bound and the pod can initialize normally
Figure 6: The volume-populator bootstrapping a new PVC for pod-0
Just Change
By using the volume-populator in the PvcAutoscaler operator we can efficiently and safely swap out the PVC to match the desired state of the volume claim template. This allows us to support changes to almost all aspects of the volume claim template, including the storage class to change the underlying storage device.
When the PvcAutoscaler operator detects drift between the volume claim template and the current PVCs of a StatefulSet it will try to determine whether the change can be done via online expansion. If it cannot, it initiates the process of swapping out the PVCs using the volume-populator. This process can be summarized by the following:
Create new PVCs for each pod containing a PvcSourcePopulator DataSource referencing the old PVC
Initiate a rolling restart of the StatefulSet
Upon pod creation the mutating webhook will inject the new PVC
The volume-populator will detect a pod being created with a reference to a PVC with a PvcSourcePopulator DataSource
The volume-populator will transfer the data and once it completes the pod will start up normally
This process is repeated for each pod in the StatefulSet
Figure 7: Swapping out the PVCs by using the volume-populator to transfer all data from the previous PVC
Once all the pods have been restarted and had their PVC bootstrapped, the StatefulSet re-enters a steady state. After a pre-configured retention period the PvcAutoscaler will automatically delete the old PVCs to reclaim any cost associated.
The capabilities of the PvcAutoscaler allows us to ignore future requirements for a stateful workload's storage characteristics, and focus on the current need. We can start with a small HDD, confident that we can boost it to SSD if the need arises. We can add on extra capacity, knowing that we can scale-in again to save on cost later. All on-demand, declarative, and toil-free.
Next Steps
The current solution is quite robust and battle tested, and has drastically improved the cost efficiency of our stateful workloads, but there is always room for improvement. In the future we’re likely to explore the following areas.
Actual auto-scaling to PvcAutoscaler allowing automated execution of PVC modifications based off of declared sizing strategies and disk metrics
Support for stateful workloads not using StatefulSets but are instead running using a custom pod controller
Hot copying of data to reduce the time it takes to scale down disks and switch storage types
Conclusion
The key takeaway we’ve taken to heart from developing the PvcAutoscaler is that Kubernetes is immensely extensible. With enough insight into the ecosystem it is possible to extend even core components with new features and remove otherwise hard limitations. This ability to extend kubernetes is especially relevant for stateful workloads which tend to have more specific requirements and limitations compared to stateless workloads.
Overall the PvcAutoscaler has provided great value since it was rolled out. While the original focus was on reducing cost by eliminating wasteful over-provisioning of storage capacity, the PvcAutoscaler has also enabled us to easily move workloads onto more performant disks when usage reached IOPS and throughput limits.
Without the PvcAutoscaler we would have been forced to expend a lot more time and resources to right-size the storage of our stateful workloads as consumer workload usage patterns change over time. In many instances the effort required to scale in storage would not be worth the cost reduction, leading to an ever increasing overhead. But the PvcAutoscaler makes even small cuts in storage worthwhile.