Skip to content

Job fails with failed 100 : assumedly after job and qmaster logs can't read usage file when piping confirmation #68

@eddiewang927

Description

@eddiewang927

Description

We observe that jobs submitted via our front-end wrapper script sometimes end with failed 100 : assumedly after job, and the qmaster message file reports it cannot read the usage file for the job. Importantly, the job is never executed on the execution host.

qmaster message


2025-12-11 16:21:27.352581|      worker|03|rdocs01|W|job 142486 .1 failed on host rd2696   assumedly after job because: can't read usage file for job 142486 .1

Client-side messages


waiting for interactive job to be scheduled ...

Your interactive job 142486 has been successfully scheduled.

Establishing builtin session to host rd2696 ...

Your job 142486 ("test_job") has been submitted

qacct -j 142486


start_time                         -/-

end_time                           -/-

granted_pe                         NONE

slots                              1

failed                             100 : assumedly after job

exit_status                        0

ru_wallclock                       0

ru_utime                           0.000

ru_stime                           0.000

ru_maxrss                          0

ru_ixrss                           0

ru_ismrss                          0

ru_idrss                           0

ru_isrss                           0

ru_minflt                          0

ru_majflt                          0

ru_nswap                           0

ru_inblock                         0

ru_oublock                         0

ru_msgsnd                          0

ru_msgrcv                          0

ru_nsignals                        0

ru_nvcsw                           0

ru_nivcsw                          0

wallclock                          0.000

cpu                                0.000

mem                                0.000

io                                 0.000

iow                                0.000

maxvmem                            0

maxrss                             0

arid                               undefined

Environment

  • Product: OCS 9.0.9 (build 141125-1311) — official binaries

  • OS/Distro: Oracle Linux 8.10

  • Front-end: wrapper submit script (run_job) that asks for yes/no confirmation before invoking qrsh and related tools

Observations

  • The job never starts on the execution host (no start_time, no resource usage).

  • Accounting shows zero usage and failed=100.

  • qmaster logs indicate missing usage file, suggesting the shepherd never wrote it.

  • The issue occurs only when confirmation is piped (e.g., echo y | run_job or y | run_job).

  • When typing y interactively, the job runs normally.

Steps to Reproduce

  1. Use the submit wrapper script that prompts for confirmation.

  2. Pipe y into the script:

    echo y | run_job
    
    # or
    
    y | run_job
    
  3. Observe that the job is scheduled but never executed, and accounting shows failed 100.

Control Case (works)

  • When typing y manually at the prompt (interactive stdin), the job runs normally and accounting is written.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions