[Bug]: Pyxis + Enroot: Multi-layer images fail due to tmpfs whiteouts, PMIx mount errors for non-PMIx jobs, and NVIDIA hook EPERM (`nvidia-persistenced/socket`) — GPU unusable unless hooks disabled (OCI backend idea optional)

## Description

When launching containers via **Pyxis + Enroot** in a **Slinky/Kubernetes-managed Slurm cluster**,  
we encountered three issues:

---

## 1) Multi-layer images (e.g., NVCR Triton) fail during whiteout conversion  
Even after setting:
```
ENROOT_CACHE_PATH=/enroot-tmp/cache
ENROOT_DATA_PATH=/enroot-tmp/data
ENROOT_RUNTIME_PATH=/enroot-tmp/run
ENROOT_TEMP_PATH=/enroot-tmp/tmp
```
in the enroot.conf. `/enroot-tmp` is a emptyDir from Node Disk.

multi-layer images fail during AUFS → overlayfs whiteout conversion if Enroot internally uses `/tmp`:
```
enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout:
/tmp/enroot.<id>/17/usr/src/python3.12/: Not supported
```
`/tmp` in the Slurm worker pods is an **overlayfs/tmpfs**, which cannot store overlay whiteouts/xattrs.  
Simple images like alpine/ubuntu succeed; NVCR-based multi-layer images fail.

## 2) PMIx hook (`50-slurm-pmi.sh`) fails on non-MPI jobs

Slurm only creates PMIx directories when `--mpi=pmix` is used:
```
/var/spool/slurmd/pmix.<jobid>.<stepid>
/tmp/spmix_appdir_<uid>_<jobid>.<stepid>
```

But Pyxis’s PMIx hook always attempts to bind-mount these directories:
```
enroot-mount: failed to mount: /tmp/spmix_appdir_*: No such file or directory
```
→ All non-MPI jobs fail unless the PMIx hook is disabled.

With `--mpi=pmix`, PMIx works correctly.
---

## 3) NVIDIA hook (`98-nvidia.sh`) fails — GPU cannot be used at all

When GPU hooks are enabled:
```
nvidia-container-cli: mount error: mount operation failed:
/enroot-tmp/data/pyxis_<job>/run/nvidia-persistenced/socket: operation not permitted
```
Causing container startup to fail:
```
pyxis: couldn't start container
spank_pyxis.so: task_init() failed
```

If the hook is disabled, containers start — but **no GPU is usable**:

- `/dev/nvidia*` devices appear, BUT:
  - `nvidia-smi` is not injected
  - NVIDIA driver libraries (`libcuda.so`, `libnvidia-ml.so`, etc.) are not mounted
  - `torch.cuda.is_available()` → `False`

Because NVCR images rely on NVIDIA Container Toolkit to inject host GPUs and driver stack,  
disabling the hook makes the container functional but CPU-only.


## Steps to Reproduce

### Environment
- Slinky-managed Slurm in Kubernetes  
- Pyxis + Enroot 4.0.1  
- NVIDIA H100 nodes (NVRM 580.95.05, CUDA 13.0)  
- GPU plugin + NVIDIA toolkit installed  
- ENROOT directories on local NVMe (`/enroot-tmp`)  
- `--mpi=pmix` tested and verified
(This is from my custom Docker Image which is built from login-pyxis:25.11-ubuntu24.04 and slurmd-pyxis.25.11-ubuntu24.04.)

### 1) Whiteout failure
```
srun --container-image <multi-layer-image> bash
```

### 2) PMIx hook failure
```
srun --container-image <any-image> bash # without --mpi=pmix
```

### 3) NVIDIA hook failure
```
srun --gpus 2 --mpi=pmix
--container-image <nvcr-image> bash
```
---

## Expected Behavior

1. **Enroot** should honor ENROOT_TEMP_PATH fully and avoid `/tmp` on tmpfs/overlayfs when extracting layers.
2. **PMIx hook** should only run when the job actually uses PMIx (`--mpi=pmix`, or detect via environment).
3. **NVIDIA hook** should succeed in injecting:
   - `/dev/nvidia*`
   - `nvidia-smi`
   - driver libraries (`libcuda.so`, etc.)
   - `/run/nvidia-persistenced/socket`
   without EPERM errors.

Overall expectation:

> GPU-enabled containers should start correctly under Pyxis + Enroot without disabling core hooks.

---

## Additional Context

### A) tmpfs/overlayfs whiteout limitations
Pyxis still directs Enroot to use `/tmp` in some stages of import, even when ENROOT_TEMP_PATH is set.  
On Kubernetes worker pods, `/tmp` is overlayfs → whiteouts fail → multi-layer images cannot be imported.

### B) PMIx directory creation behavior
Slurm only generates PMIx directories for jobs using `--mpi=pmix`.  
Hook should skip PMIx mounts for non-MPI jobs to avoid failures.

### C) NVIDIA hook mount EPERM
Pyxis runs Enroot inside a user namespace; `nvidia-container-cli configure` requires privileged bind-mounts into the Enroot rootfs.  
This fails with EPERM even when the Slurm worker pod is `privileged: true`.

### D) GPU works only if hook is disabled
Disabling `98-nvidia.sh` allows containers to start (including multi-layer NVCR images),  
but GPU libraries are missing → CUDA unavailable inside container.

---

## Optional Suggestion: Support OCI backend (rootless docker/podman)

*Optional — not required for resolving the bug.*

If other **OCI backends** using **rootless Docker or Podman**, several issues might be avoided:
(via `oci.conf` + `Authtype=auth/slurm`). From the slurm documentation, it says that Authtype=auth/munge only supports rootless Docker or Podman.

- OCI runtimes already handle:
  - driver injection
  - mount permissions
  - whiteout semantics
- Would reduce the need for Pyxis to perform privileged mount operations inside user namespaces
- Could avoid tmpfs whiteout problems entirely  
- Might eliminate the need for custom Enroot NVIDIA hooks

Not required, but may offer a clean long-term architectural path.

---

If more logs (Pyxis debug, Enroot mount trace, NVIDIA Toolkit debug) would be helpful,  
I can provide them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Pyxis + Enroot: Multi-layer images fail due to tmpfs whiteouts, PMIx mount errors for non-PMIx jobs, and NVIDIA hook EPERM (`nvidia-persistenced/socket`) — GPU unusable unless hooks disabled (OCI backend idea optional) #99

Description

1) Multi-layer images (e.g., NVCR Triton) fail during whiteout conversion

2) PMIx hook (`50-slurm-pmi.sh`) fails on non-MPI jobs

With `--mpi=pmix`, PMIx works correctly.

3) NVIDIA hook (`98-nvidia.sh`) fails — GPU cannot be used at all

Steps to Reproduce

Environment

1) Whiteout failure

2) PMIx hook failure

3) NVIDIA hook failure

Expected Behavior

Additional Context

A) tmpfs/overlayfs whiteout limitations

B) PMIx directory creation behavior

C) NVIDIA hook mount EPERM

D) GPU works only if hook is disabled

Optional Suggestion: Support OCI backend (rootless docker/podman)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Pyxis + Enroot: Multi-layer images fail due to tmpfs whiteouts, PMIx mount errors for non-PMIx jobs, and NVIDIA hook EPERM (nvidia-persistenced/socket) — GPU unusable unless hooks disabled (OCI backend idea optional) #99

Description

Description

1) Multi-layer images (e.g., NVCR Triton) fail during whiteout conversion

2) PMIx hook (50-slurm-pmi.sh) fails on non-MPI jobs

With --mpi=pmix, PMIx works correctly.

3) NVIDIA hook (98-nvidia.sh) fails — GPU cannot be used at all

Steps to Reproduce

Environment

1) Whiteout failure

2) PMIx hook failure

3) NVIDIA hook failure

Expected Behavior

Additional Context

A) tmpfs/overlayfs whiteout limitations

B) PMIx directory creation behavior

C) NVIDIA hook mount EPERM

D) GPU works only if hook is disabled

Optional Suggestion: Support OCI backend (rootless docker/podman)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Pyxis + Enroot: Multi-layer images fail due to tmpfs whiteouts, PMIx mount errors for non-PMIx jobs, and NVIDIA hook EPERM (`nvidia-persistenced/socket`) — GPU unusable unless hooks disabled (OCI backend idea optional) #99

2) PMIx hook (`50-slurm-pmi.sh`) fails on non-MPI jobs

With `--mpi=pmix`, PMIx works correctly.

3) NVIDIA hook (`98-nvidia.sh`) fails — GPU cannot be used at all