Skip to content

feat: report ephemeral storage sizes of node as custom metric#644

Open
patrickpichler wants to merge 1 commit intomainfrom
report-node-ephemeral-storage-metrics
Open

feat: report ephemeral storage sizes of node as custom metric#644
patrickpichler wants to merge 1 commit intomainfrom
report-node-ephemeral-storage-metrics

Conversation

@patrickpichler
Copy link
Contributor

@patrickpichler patrickpichler commented Feb 20, 2026

Kvisor is now able to determine the amount of ephemeral storage available and used per node. The data is reported as a new field on the storage_node_metrics metric. A new ephemeral-storage-source argument to kvisor-agent configures how the collection of storage metrics is going to happen. There are currently four modes available:

  • none - no ephemeral storage metrics will be ingested
  • kubelet - use the data provided kubelet to fill the metric
  • storage-optimization - if CAST.AI storage-optimization is present, this mode will take this into consideration
  • auto - automatically detects if storage-optimization is active and configured the mode accordingly.

The logic for storage-optimization is far from perfect, as it currently always sums the available disk space of the volume holding /var/lib/kubelet, as well as /var/lib/castai-storage.

@patrickpichler patrickpichler force-pushed the report-node-ephemeral-storage-metrics branch 5 times, most recently from d99c663 to bc1968c Compare February 23, 2026 12:32
// csi.cast.ai CSI driver (/var/lib/castai-storage), deduplicated by device ID.
// This avoids double-counting PVC-backed mounts that reside under /var/lib/kubelet.
EphemeralStorageSourceStorageOptimization EphemeralStorageSource = "storage-optimization"
// EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage
// at each collection cycle

This comment is most likely incorrect.

EphemeralStorageSourceAuto probes for the presence of /var/lib/castai-storage once during the start up, and then it uses the selected EphemeralStorageSource continually, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The comment has been updated.

Kvisor is now able to determine the amount of ephemeral storage available and
used per node. The data is reported as a new field on the `storage_node_metrics`
metric. A new `ephemeral-storage-source` argument to kvisor-agent configures
how the collection of storage metrics is going to happen.

There are currently four modes available:
* none - no ephemeral storage metrics will be ingested
* kubelet - use the data provided kubelet to fill the metric
* storage-optimization - if CAST.AI storage-optimization is present, this mode
  will take this into consideration
* auto - automatically detects if storage-optimization is active and configured
  the mode accordingly.

The logic for `storage-optimization` is far from perfect, as it currently
always sums the available disk space of the volume holding `/var/lib/kubelet`,
as well as `/var/lib/castai-storage`.
@patrickpichler patrickpichler force-pushed the report-node-ephemeral-storage-metrics branch from bc1968c to b474f1b Compare February 23, 2026 14:01
// sum of the disk hosting /var/lib/kubelet and the disk backing the
// csi.cast.ai CSI driver (/var/lib/castai-storage), deduplicated by device ID.
// This avoids double-counting PVC-backed mounts that reside under /var/lib/kubelet.
EphemeralStorageSourceStorageOptimization EphemeralStorageSource = "storage-optimization"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering could we simplify the --ephemeral-storage-source flag?

Currently it has 4 possible values which may be hard to maintain in the feature.

Instead, would a single auto strategy be enough? The logic could be:

  1. Always take the base ephemeral storage from the kubelet API (node.fs) — this is the standard Kubernetes-reported value.
  2. Probe for /var/lib/castai-storage — if it exists and is on a different device than the root filesystem, add its capacity/usage on top.

This way we always report correct ephemeral storage: standard kubelet numbers when there's no CAST AI storage solution, and kubelet + additional disk when there is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can most definitely simplify the flag. My main reasoning was to provide an easy way to override the auto detection decision. If we say it is not worth it, then let's remove it 😅

One reason we might want to keep a separate way of fetching those metrics open other than calling the kubelet endpoint is, that by requiring node/proxy GET, we effectively can exec into pods. Some customers might not be too happy about this, but overall not the end of the world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants