feat: report ephemeral storage sizes of node as custom metric#644
feat: report ephemeral storage sizes of node as custom metric#644patrickpichler wants to merge 1 commit intomainfrom
Conversation
d99c663 to
bc1968c
Compare
| // csi.cast.ai CSI driver (/var/lib/castai-storage), deduplicated by device ID. | ||
| // This avoids double-counting PVC-backed mounts that reside under /var/lib/kubelet. | ||
| EphemeralStorageSourceStorageOptimization EphemeralStorageSource = "storage-optimization" | ||
| // EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage |
There was a problem hiding this comment.
// EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage
// at each collection cycle
This comment is most likely incorrect.
EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage once during the start up, and then it uses the selected EphemeralStorageSource continually, right?
There was a problem hiding this comment.
Good catch! The comment has been updated.
Kvisor is now able to determine the amount of ephemeral storage available and used per node. The data is reported as a new field on the `storage_node_metrics` metric. A new `ephemeral-storage-source` argument to kvisor-agent configures how the collection of storage metrics is going to happen. There are currently four modes available: * none - no ephemeral storage metrics will be ingested * kubelet - use the data provided kubelet to fill the metric * storage-optimization - if CAST.AI storage-optimization is present, this mode will take this into consideration * auto - automatically detects if storage-optimization is active and configured the mode accordingly. The logic for `storage-optimization` is far from perfect, as it currently always sums the available disk space of the volume holding `/var/lib/kubelet`, as well as `/var/lib/castai-storage`.
bc1968c to
b474f1b
Compare
| // sum of the disk hosting /var/lib/kubelet and the disk backing the | ||
| // csi.cast.ai CSI driver (/var/lib/castai-storage), deduplicated by device ID. | ||
| // This avoids double-counting PVC-backed mounts that reside under /var/lib/kubelet. | ||
| EphemeralStorageSourceStorageOptimization EphemeralStorageSource = "storage-optimization" |
There was a problem hiding this comment.
I am wondering could we simplify the --ephemeral-storage-source flag?
Currently it has 4 possible values which may be hard to maintain in the feature.
Instead, would a single auto strategy be enough? The logic could be:
- Always take the base ephemeral storage from the kubelet API (node.fs) — this is the standard Kubernetes-reported value.
- Probe for /var/lib/castai-storage — if it exists and is on a different device than the root filesystem, add its capacity/usage on top.
This way we always report correct ephemeral storage: standard kubelet numbers when there's no CAST AI storage solution, and kubelet + additional disk when there is.
There was a problem hiding this comment.
We can most definitely simplify the flag. My main reasoning was to provide an easy way to override the auto detection decision. If we say it is not worth it, then let's remove it 😅
One reason we might want to keep a separate way of fetching those metrics open other than calling the kubelet endpoint is, that by requiring node/proxy GET, we effectively can exec into pods. Some customers might not be too happy about this, but overall not the end of the world.
Kvisor is now able to determine the amount of ephemeral storage available and used per node. The data is reported as a new field on the
storage_node_metricsmetric. A newephemeral-storage-sourceargument to kvisor-agent configures how the collection of storage metrics is going to happen. There are currently four modes available:The logic for
storage-optimizationis far from perfect, as it currently always sums the available disk space of the volume holding/var/lib/kubelet, as well as/var/lib/castai-storage.