-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Description
When launching containers via Pyxis + Enroot in a Slinky/Kubernetes-managed Slurm cluster,
we encountered three issues:
1) Multi-layer images (e.g., NVCR Triton) fail during whiteout conversion
Even after setting:
ENROOT_CACHE_PATH=/enroot-tmp/cache
ENROOT_DATA_PATH=/enroot-tmp/data
ENROOT_RUNTIME_PATH=/enroot-tmp/run
ENROOT_TEMP_PATH=/enroot-tmp/tmp
in the enroot.conf. /enroot-tmp is a emptyDir from Node Disk.
multi-layer images fail during AUFS → overlayfs whiteout conversion if Enroot internally uses /tmp:
enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout:
/tmp/enroot.<id>/17/usr/src/python3.12/: Not supported
/tmp in the Slurm worker pods is an overlayfs/tmpfs, which cannot store overlay whiteouts/xattrs.
Simple images like alpine/ubuntu succeed; NVCR-based multi-layer images fail.
2) PMIx hook (50-slurm-pmi.sh) fails on non-MPI jobs
Slurm only creates PMIx directories when --mpi=pmix is used:
/var/spool/slurmd/pmix.<jobid>.<stepid>
/tmp/spmix_appdir_<uid>_<jobid>.<stepid>
But Pyxis’s PMIx hook always attempts to bind-mount these directories:
enroot-mount: failed to mount: /tmp/spmix_appdir_*: No such file or directory
→ All non-MPI jobs fail unless the PMIx hook is disabled.
With --mpi=pmix, PMIx works correctly.
3) NVIDIA hook (98-nvidia.sh) fails — GPU cannot be used at all
When GPU hooks are enabled:
nvidia-container-cli: mount error: mount operation failed:
/enroot-tmp/data/pyxis_<job>/run/nvidia-persistenced/socket: operation not permitted
Causing container startup to fail:
pyxis: couldn't start container
spank_pyxis.so: task_init() failed
If the hook is disabled, containers start — but no GPU is usable:
/dev/nvidia*devices appear, BUT:nvidia-smiis not injected- NVIDIA driver libraries (
libcuda.so,libnvidia-ml.so, etc.) are not mounted torch.cuda.is_available()→False
Because NVCR images rely on NVIDIA Container Toolkit to inject host GPUs and driver stack,
disabling the hook makes the container functional but CPU-only.
Steps to Reproduce
Environment
- Slinky-managed Slurm in Kubernetes
- Pyxis + Enroot 4.0.1
- NVIDIA H100 nodes (NVRM 580.95.05, CUDA 13.0)
- GPU plugin + NVIDIA toolkit installed
- ENROOT directories on local NVMe (
/enroot-tmp) --mpi=pmixtested and verified
(This is from my custom Docker Image which is built from login-pyxis:25.11-ubuntu24.04 and slurmd-pyxis.25.11-ubuntu24.04.)
1) Whiteout failure
srun --container-image <multi-layer-image> bash
2) PMIx hook failure
srun --container-image <any-image> bash # without --mpi=pmix
3) NVIDIA hook failure
srun --gpus 2 --mpi=pmix
--container-image <nvcr-image> bash
Expected Behavior
- Enroot should honor ENROOT_TEMP_PATH fully and avoid
/tmpon tmpfs/overlayfs when extracting layers. - PMIx hook should only run when the job actually uses PMIx (
--mpi=pmix, or detect via environment). - NVIDIA hook should succeed in injecting:
/dev/nvidia*nvidia-smi- driver libraries (
libcuda.so, etc.) /run/nvidia-persistenced/socket
without EPERM errors.
Overall expectation:
GPU-enabled containers should start correctly under Pyxis + Enroot without disabling core hooks.
Additional Context
A) tmpfs/overlayfs whiteout limitations
Pyxis still directs Enroot to use /tmp in some stages of import, even when ENROOT_TEMP_PATH is set.
On Kubernetes worker pods, /tmp is overlayfs → whiteouts fail → multi-layer images cannot be imported.
B) PMIx directory creation behavior
Slurm only generates PMIx directories for jobs using --mpi=pmix.
Hook should skip PMIx mounts for non-MPI jobs to avoid failures.
C) NVIDIA hook mount EPERM
Pyxis runs Enroot inside a user namespace; nvidia-container-cli configure requires privileged bind-mounts into the Enroot rootfs.
This fails with EPERM even when the Slurm worker pod is privileged: true.
D) GPU works only if hook is disabled
Disabling 98-nvidia.sh allows containers to start (including multi-layer NVCR images),
but GPU libraries are missing → CUDA unavailable inside container.
Optional Suggestion: Support OCI backend (rootless docker/podman)
Optional — not required for resolving the bug.
If other OCI backends using rootless Docker or Podman, several issues might be avoided:
(via oci.conf + Authtype=auth/slurm). From the slurm documentation, it says that Authtype=auth/munge only supports rootless Docker or Podman.
- OCI runtimes already handle:
- driver injection
- mount permissions
- whiteout semantics
- Would reduce the need for Pyxis to perform privileged mount operations inside user namespaces
- Could avoid tmpfs whiteout problems entirely
- Might eliminate the need for custom Enroot NVIDIA hooks
Not required, but may offer a clean long-term architectural path.
If more logs (Pyxis debug, Enroot mount trace, NVIDIA Toolkit debug) would be helpful,
I can provide them.