Skip to content

Conversation

@aramprice
Copy link
Member

@aramprice aramprice commented Jan 8, 2026

Part of: #2655

Local diff for fly'ing the bosh-director pipeline to consume this branch:

diff --git a/ci/pipeline.yml b/ci/pipeline.yml
index 19a38a9bd4..3f84be54ff 100644
--- a/ci/pipeline.yml
+++ b/ci/pipeline.yml
@@ -1,6 +1,6 @@
 ---
 anchors:
-  - &branch_name main
+  - &branch_name noble-cut-over
   - &brats_env_name brats-bosh-main
   - &ubuntu_base ubuntu:noble
   - &integration_image ghcr.io/cloudfoundry/bosh/integration

Known issues

There are three failing jobs in CI, two are blocking and one is not:

Neither docker-cpi, nor warden-cpi can boot a "VM" (container image) on the existing Concourse workers (jammy stemcell based) because (at least) Noble stemcells rely on Linux cgroupsv2 which are not present in jammy-based stemcells - for the "protect monit" behavior.

The fix to allow the warden-cpi to successfully boot an agent is in ci/dockerfiles/warden-cpi/start-bosh.sh:7-9, see comment for context on making warden boot containers with systemd as PID 1. Based on this bosh-warden-cpi-release commit, thanks @jpalermo 🎉 .

Todos

Make the following CI jobs pass

- replace ruby with bash
- fail if more than one IP addr is found for "OUTER_CONTAINER_IP"
- sundry bash lint cleanup
- rename some job
- remove (ubuntu-)jammy references
- replace jammy with noble
- replace concours `var` with YAML anchor(s)
- addresses ruby style / deprecation issues
- noble behavior is alreayd the bosh-deployment default
- docker-cpi image has `docker-env` for debug use
- docker-cpi cleanup `start-bosh` script
This is required because the FIPS stemcell is currently Jammy and the
compiled release is for Noble.
Based on this commit[1] we are mutating the in-container garden.ini
setting to make Warden use systemd instead of the garden-provide init
binary.

[1] cloudfoundry/bosh-warden-cpi-release@434738f#diff-f3d9c00d365d08274b8e73e1dc4fc4b2d38a92a654d4d2b27f4ffdc01730576bR1-R8
@aramprice
Copy link
Member Author

One notable help in debugging was installing the gaol CLI for warden. This might be something that should be installed on the on the warden-cpi OCI container.
Refs:

cc @mkocher who may be able to provide more context

@aramprice
Copy link
Member Author

A quick scan of Concourse shows that there have been problems with cgroups v2[1] in the past. These may have been resolved. Leaving this link here as a breadcrumb in case there are issues running a Concourse worker on an ubuntu-noble stemcell.

[1] concourse/concourse#5080

@aramprice
Copy link
Member Author

aramprice commented Jan 8, 2026

Manually attempting to deploy a single noble based worker tagged with noble:

bosh deploy diff

bosh -d concourse deploy concourse.yml
Using environment '10.0.0.3' as client 'admin'

Using deployment 'concourse'

  stemcells:
+ - alias: noble
+   os: ubuntu-noble
+   version: '1.188'

  addons:
+ - include:
+     stemcell:
+     - os: ubuntu-noble
+   jobs:
+   - name: bosh-dns
+     properties:
+       api:
+         client:
+           tls: "<redacted>"
+         server:
+           tls: "<redacted>"
+       cache:
+         enabled: "<redacted>"
+       configure_systemd_resolved: "<redacted>"
+       disable_recursors: "<redacted>"
+       health:
+         client:
+           tls: "<redacted>"
+         enabled: "<redacted>"
+         server:
+           tls: "<redacted>"
+       override_nameserver: "<redacted>"
+     release: bosh-dns
+   name: bosh-dns-systemd

  instance_groups:
+ - azs:
+   - az1
+   instances: 1
+   jobs:
+   - name: worker
+     properties:
+       baggageclaim:
+         driver: "<redacted>"
+       drain_timeout: "<redacted>"
+       garden:
+         max_containers: "<redacted>"
+       tags:
+       - "<redacted>"
+       worker_gateway:
+         worker_key: "<redacted>"
+     release: concourse
+   name: worker-noble
+   networks:
+   - name: default
+   stemcell: noble
+   update:
+     canaries: 0
+     max_in_flight: 50%
+   vm_type: concourse_worker_16_64

The diff above is against the concourse deployment manifest which can be acquired via:

bosh -d concourse manifest > concourse.yml

The current FIWG concourse deployment now has a single noble-based Concourse worker. This change is ephemeral and will be removes when the periodic redeploy happens.

@aramprice
Copy link
Member Author

Presumably the stanaize_cgroups() function in docker-cpi/start-bosh.sh will need to be updated for cgroups v2

Build running docker-cpi on Noble concourse worker: https://bosh.ci.cloudfoundry.org/teams/main/pipelines/bosh-director/jobs/brats-acceptance/builds/459

@aramprice
Copy link
Member Author

aramprice commented Jan 9, 2026

warden-cpi image start-bosh script fails to work on a Concourse worker running on a noble stemcell.


Possible root error message from /var/vcap/sys/log/garden/garden_ctl.stderr.log:

/var/vcap/packages/greenskeeper/bin/system-preparation: line 64: /sys/kernel/mm/transparent_hugepage/enabled: Read-only file system
Could not disable automatic transparent hugepage allocation. This is normal in bosh lite.
{"timestamp":"2026-01-08T23:27:02.443504992Z","level":"error","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.mounting-fs-failed","data":{"allOpts":"loop,pquota,noatime","destination":"/var/vcap/data/grootfs/store/unprivileged","error":"exit status 32","session":"1.1","source":"/var/vcap/data/grootfs/store/unprivileged.backing-store","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":683174301696},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"2026-01-08T23:27:02.443725452Z","level":"info","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.existing-backing-store-could-not-be-mounted: Mounting filesystem: exit status 32: mount: /var/vcap/data/grootfs/store/unprivileged: failed to setup loop device for /var/vcap/data/grootfs/store/unprivileged.backing-store.\n","data":{"session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":683174301696},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"2026-01-08T23:27:02.446366102Z","level":"error","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-fs-failed","data":{"allOpts":"remount,loop,pquota,noatime","destination":"/var/vcap/data/grootfs/store/unprivileged","error":"exit status 32","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","source":"/var/vcap/data/grootfs/store/unprivileged.backing-store","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":683174301696},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"2026-01-08T23:27:03.511723386Z","level":"error","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-fs-failed","data":{"allOpts":"loop,pquota,noatime","destination":"/var/vcap/data/grootfs/store/unprivileged","error":"exit status 32","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","source":"/var/vcap/data/grootfs/store/unprivileged.backing-store","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":683174301696},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"2026-01-08T23:27:03.511807636Z","level":"error","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.initializing-filesystem-failed","data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"Mounting filesystem: exit status 32: mount: /var/vcap/data/grootfs/store/unprivileged: failed to setup loop device for /var/vcap/data/grootfs/store/unprivileged.backing-store.\n","session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":683174301696},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"2026-01-08T23:27:03.511850176Z","level":"error","source":"grootfs","message":"grootfs.init-store.init-store-failed","data":{"error":"initializing filesyztem: Mounting filesystem: exit status 32: mount: /var/vcap/data/grootfs/store/unprivileged: failed to setup loop device for /var/vcap/data/grootfs/store/unprivileged.backing-store.\n","session":"1"}}

We thing this corresponds to the following invocation chain, starting with start-bosh in the container running on Concourse (noble stemcell in this case):

The start-bosh script:
https://github.com/cloudfoundry/bosh/blob/noble-cut-over/ci/dockerfiles/warden-cpi/start-bosh.sh#L16

Calls garden_ctl start which become garden_start which is rendered into the warden-cpi container via install-garden.rb in this repo:
https://github.com/cloudfoundry/bosh/blob/noble-cut-over/ci/dockerfiles/warden-cpi/Dockerfile#L49

The garden_start script is rendred from the garden-runc-release template
https://github.com/cloudfoundry/garden-runc-release/blob/develop/jobs/garden/templates/bin/garden_start.erb

Which eventually invokes the script overlay-xfs-setup
https://github.com/cloudfoundry/garden-runc-release/blob/develop/jobs/garden/templates/bin/garden_start.erb#L35

And this script then invokes groot which fails to setup the backing store for the garden server
https://github.com/cloudfoundry/garden-runc-release/blob/develop/jobs/garden/templates/bin/overlay-xfs-setup#L36-L40

Then within groot:

It seems possible that the method permit_device_control in garden-runc-release/src/greenskeeper/scripts/system-preparation may need to be changed for cgroups v2 similar to this recent change in a similar script.

We believe that groot is failing when running the following command:

mount -o remount,loop,pquota,noatime -t xfs /var/vcap/data/grootfs/store/unprivileged.backing-store /var/vcap/data/grootfs/store/unprivileged

From the execution if this groot command:
https://github.com/cloudfoundry/garden-runc-release/blob/develop/jobs/garden/templates/bin/overlay-xfs-setup#L36-L40

... and we are not sure why

Ref: https://bosh.ci.cloudfoundry.org/teams/main/pipelines/bosh-director/jobs/bosh-disaster-recovery-acceptance-tests/builds/418

Set the env var DEBUG to any non-empty value and both bash and bosh
debug logging will be enabled.
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 10, 2026

CLA Missing ID CLA Not Signed

@aramprice
Copy link
Member Author

@rkoster / @beyhan - do either of you understand why the @cf-rabbit-bot and CI Bot commits are being flagged by the CLA checker?

…64 -> verify-multidigest/verify-multidigest-0.0.580-linux-amd64
@beyhan
Copy link
Member

beyhan commented Jan 13, 2026

Hi @aramprice,

usually, @christopherclark is helping us to register bots with EasyCLA. Did we start using those bots shortly and they need now EasyCLA?

@aramprice
Copy link
Member Author

aramprice commented Jan 13, 2026

@beyhan - it might be because CI is currently running on this PR's branch (noble-cut-over)? As far as I know none of these bots are new.

There is a CI failure in the bbl pipeline which also appears to be related to the CI bot:

Cloning into '/tmp/semver-git-repo'...
Warning: Permanently added 'github.com' (ED25519) to the list of known hosts.
HEAD is now at 19a10dd bump to 9.0.42
ERROR: Permission to cloudfoundry/bosh-bootloader.git denied to cf-bosh-ci-bot.
fatal: Could not read from remote repository.

Perhaps something has changed with Bots in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants