-
Notifications
You must be signed in to change notification settings - Fork 150
Description
Describe the bug
Sometimes Nexus does not catch slurm job failures. This results in the Nexus process running forever as opposed to exiting with clearly identified failed workflow steps. Noticed when a slurm job on baseline for a QMCPACK optimization failed: QMCPACK mysteriously crashed, the slurm job exited, but Nexus did not adjust.
After initial conversations with Jaron, this is due to Nexus checking for output files vs the job still existing in the slurm queue. This was an historically defensive measure but I contend that these days even remote filesystems should be fast + reliable enough that we can immediately rely on the slurm job status.
I suspect the same issue occurs with "workstation" runs if a running process is killed early enough that expected output is not present.
To Reproduce
I have a reproducible environment + jobs on baseline. Likely killing an optimization job very early will trigger the problem e.g. during the first configuration generation phase, .
Expected behavior
Nexus process should terminate once all eligible jobs have been attempted and are no longer queued.
System:
baseline at ORNL
Additional context
Issue created after initial discussions.