Skip to content

Conversation

@mariash
Copy link
Member

@mariash mariash commented Jan 8, 2026

This PR adds the RFC "Dynamic Disks Support in BOSH".

For easier viewing, you can see the full RFC as preview.

@mariash mariash force-pushed the dynamic-disks branch 2 times, most recently from edb5daa to 83fa9cd Compare January 8, 2026 18:26
@beyhan beyhan requested review from a team, Gerg, beyhan, cweibel, rkoster and stephanme and removed request for a team January 9, 2026 10:18
@beyhan beyhan added rfc CFF community RFC toc labels Jan 9, 2026
@rkoster
Copy link
Contributor

rkoster commented Jan 9, 2026

Architectural Concerns: Runtime Dependencies, Security Model, and Adoption Path

This RFC proposes significant shifts in BOSH's operational and security model that need to be addressed before proceeding with API design.


1. BOSH Director in the Runtime Critical Path

Current State:
The BOSH Director functions purely as a control plane. Once workloads are deployed, the Director can be completely unavailable without impacting running applications. Cloud Foundry, Kubernetes clusters, and other BOSH-deployed systems continue operating normally during Director downtime.

Proposed State:
This RFC introduces runtime dependencies on the Director. Workloads that need to dynamically provision or attach disks will fail if the Director is unavailable. This expands the blast radius of Director outages from "can't deploy/scale" to "application-level failures."

Blocker: There is no open-source high-availability solution for BOSH Director. The Director is typically deployed as a single instance and is not designed for the uptime guarantees that runtime-critical infrastructure requires. This RFC should be contingent on an HA Director solution existing first, unless the design is changed such that the Director is no longer in the critical path of workload operations.


2. Security Architecture Change

Current State:
Workload clusters (e.g., Kubernetes) do not need network access to the BOSH Director API, nor do they require Director credentials. The Director is an infrastructure concern, isolated from the workloads it manages.

Proposed State:
This RFC requires workload clusters to:

  • Have network connectivity to the Director API
  • Hold credentials authorized to call Director endpoints
  • Make authenticated API calls at runtime

This is a substantial change to the security boundary. Workloads now become potential attack vectors against the Director. A compromised Kubernetes cluster could potentially:

  • Manipulate disks across deployments (depending on authorization scope)
  • Denial-of-service the Director API
  • Leverage Director credentials for lateral movement

The RFC's authorization model is underspecified - it mentions "authorized clients" but doesn't detail how credentials are scoped, rotated, or isolated per workload.


3. No OSS Consumer

There is no open-source BOSH release identified as an adopter of this feature. For the community to take on the additional complexity this RFC introduces - both in the Director codebase and in operational requirements - there should be a concrete plan for at least one OSS BOSH release to adopt dynamic disks. Without this, the feature adds maintenance burden with no clear benefit to the community.

@metskem
Copy link
Contributor

metskem commented Jan 9, 2026

Agree with Ruben's statements.
The RFC introduces a dangerous dependency on the BOSH director, and making the BOSH director high available would introduce a lot of extra complexity.
The requested functionality (disk provisioning) looks more something to be solved with a Service Broker.

@mariash
Copy link
Member Author

mariash commented Jan 9, 2026

@rkoster @metskem to address the concerns:

  1. First of all, this feature is completely opt-in. By default no new API is exposed, no additional jobs can be scheduled and executed . Existing BOSH behavior remains unchanged. Disk management jobs are executed by separate Director workers and the number is configurable and defaults to 0.
  2. For your first point on Director availability and runtime dependency, the way this feature is used depends on the consumer implementation. For example, Kubernetes CSI controllers are built around reconcile and retry model. If Director API is temporary unavailable, new operations will be delayed, but the system will converge to the desired state once Director returns. While having an HA Director would be ideal, lack of it should not block this opt-in feature with eventual-consistency semantics.
  3. For the point on security considerations, consumers of the disk management API like Disk Provisioner can be deployed in a secure and isolated way. They do not need to run along side workloads and can be deployed in a restricted environment or even collocated with the BOSH Director on the same VM.
  4. There is no OSS consumer currently for this. This feature enables futures Kubernetes integration and strengthens BOSH's role as a single control plane for IAAS resources with disk management integrated into VM lifecycle and protected from conflicting VM operations. Historically, Cloud Foundry and BOSH have accepted opt-int features that served the needs of specific community members or commercial use cases, provided they don't change the default behavior and the maintainers were willing to support them. We are prepared to take the responsibility for the ongoing maintenance of this feature. Keeping it in BOSH allows future orchestrators to benefit from a shared implementation rather than encouraging private solutions outside of the project.

@rkoster
Copy link
Contributor

rkoster commented Jan 10, 2026

@mariash Thank you for your response. Could you provide some recent examples (links to PRs or commits) to support:

Historically, Cloud Foundry and BOSH have accepted opt-int features that served the needs of specific community members or commercial use cases.

@mariash
Copy link
Member Author

mariash commented Jan 12, 2026

@rkoster here is a recent example: cloudfoundry/routing-release#451

@mariash
Copy link
Member Author

mariash commented Jan 12, 2026

I updated the proposal with the potential BOSH CSI implementation. That CSI should be straightforward to implement and would be beneficial for BOSH community. We would like to keep BOSH changes in the upstream if possible.

@Alphasite
Copy link
Contributor

I wanted to call out what I see as the positive security implications here. In practice, the alternative to this feature isn’t “no dynamic disks” — it’s implementing dynamic disks outside BOSH, which tends to push cloud permissions into the workload/tenant boundary.

That means we move from “IaaS privileges live in the Director (which is already the trusted component for IaaS access)” to “each workload environment needs some form of IaaS permission,” whether that’s static keys, instance roles, or workload identity. Even with the more modern options, you’re still granting cloud-level capabilities inside a boundary that’s harder to secure and audit consistently.

I’ve seen this play out in PKS-era Kubernetes (and it’s m OSS KUBO release) on BOSH using cloud CSI plugins: it worked, but every cluster needed cloud permissions to provision/attach volumes. That increased blast radius — compromising the cluster control plane could translate into cloud-level capabilities — and it increased operational burden because we had to ship/patch 5 CSIs (1 for each IAAS) in the tenant environment.

This design centralizes privileged disk operations back into the Director and enables narrowly scoped UAA access for consumers (e.g., disk operations only), which reduces credential sprawl and reduces the number of places where cloud-level privileges exist. Compromising a cluster no longer immediately yields IaaS privileges.

This isn’t risk-free: it expands the Director API surface and makes availability important for new disk operations, so it needs guardrails (tight scopes, per-deployment credentials, network restrictions, and strong auditing). But compared to the current workarounds, I think this is a net improvement in least privilege and operational security.

More broadly, I view this as a foundational primitive: not valuable on its own, but an enabling capability that makes it significantly easier for anyone to build stateful workloads on top of BOSH without reinventing (and re-securing) a parallel disk-control plane in every deployment.

@rkoster
Copy link
Contributor

rkoster commented Jan 13, 2026

@mariash Thank you for providing context on the routing-release example. However, I don't think routing-release#451 (cloudfoundry/routing-release#451) supports the precedent you're citing. That issue was a straightforward library choice (whether to use go-metric-registry vs Prometheus directly) — a minor implementation decision with no architectural implications, no new API surface, and no changes to operational or security models.
What would demonstrate the claimed precedent is an example where the community accepted an opt-in feature that:

  • Introduced significant new capabilities with commercial/specific use-case motivation
  • Added maintenance burden with commitment from maintainers to support it
  • Had comparable scope: new APIs, security considerations, or operational model changes

Separately, I'd like to raise a concern about the structure of this RFC. It appears to combine two distinct proposals:

  1. Dynamic Disk Lifecycle Management
    The ability to create, attach, detach, move, and delete persistent disks on BOSH-managed VMs with more flexibility than currently exists.
  2. Out-of-Band Runtime API
    A new /dynamic_disks/* API that allows external systems to provision disks at runtime, bypassing the deployment manifest.
    These have very different implications. Proposal 1 could potentially fit within BOSH's existing model — disk operations triggered through bosh deploy with the manifest remaining the source of truth. Proposal 2 is what introduces the architectural concerns we've raised: runtime dependency on Director, credentials in workload boundaries, and state that exists outside the manifest.
    Question: Can the dynamic disk lifecycle capabilities be implemented without the out-of-band API? For example, if a CSI controller needs to provision a disk, could it update the deployment manifest and trigger a deploy rather than calling a runtime API directly?
    If these proposals can be decoupled, I'd suggest splitting the RFC. The community could evaluate the disk lifecycle improvements on their own merits, separate from the more contentious question of whether BOSH should move away from the "everything is a bosh deploy" paradigm.

@rkoster
Copy link
Contributor

rkoster commented Jan 13, 2026

@Alphasite I agree with your framing of this as a foundational primitive, and I share your concern about IaaS credentials sprawling into workload boundaries. Centralizing disk operations is a sound architectural goal.
However, I don't think BOSH itself needs to implement the full runtime API to achieve this. The primitive that BOSH should provide is dynamic disk lifecycle management — the ability to create, attach, detach, move, and delete disks on VMs. This could remain within BOSH's existing model, triggered through bosh deploy with the manifest as source of truth.
The runtime orchestration layer (the thing that decides when to provision disks and calls the appropriate APIs) could be a separate component that consumes BOSH's disk primitives. This component — whether it's a CSI controller, a dedicated disk provisioner, or something else — would be responsible for the reconcile/retry logic, credential management, and integration with workload schedulers.
This separation would:

  • Keep BOSH focused on IaaS abstraction and VM lifecycle (what it's good at)
  • Avoid introducing runtime dependencies on the Director
  • Allow the orchestration layer to evolve independently (different implementations for different use cases)
  • Still achieve the security benefits you described (IaaS credentials stay in trusted infrastructure, not workloads)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc CFF community RFC toc

Projects

Status: Inbox

Development

Successfully merging this pull request may close these issues.

5 participants