Nexus: Failed jobs not caught, running forever

**Describe the bug**

Sometimes Nexus does not catch slurm job failures. This results in the Nexus process running forever as opposed to exiting with clearly identified failed workflow steps. Noticed when a slurm job on baseline for a QMCPACK optimization failed: QMCPACK mysteriously crashed, the slurm job exited, but Nexus did not adjust. 

After initial conversations with Jaron, this is due to Nexus checking for output files vs the job still existing in the slurm queue. This was an historically defensive measure but I contend that these days even remote filesystems should be fast + reliable enough that we can immediately rely on the slurm job status.

I suspect the same issue occurs with "workstation" runs if a running process is killed early enough that expected output is not present.

**To Reproduce**

I have a reproducible environment + jobs on baseline. Likely killing an optimization job very early will trigger the problem e.g. during the first configuration generation phase, .
 
**Expected behavior**

Nexus process should terminate once all eligible jobs have been attempted and are no longer queued.

**System:**

baseline at ORNL

**Additional context**

Issue created after initial discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nexus: Failed jobs not caught, running forever #5605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nexus: Failed jobs not caught, running forever #5605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions