Skip to content

psutil 5.9.6 seems to be throwing ZombieProcess when retrieving the mms process #132

@charlietruong-wk

Description

@charlietruong-wk

Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.

The change in psutil may be this fix on their end with what they recognize as a ZombieProcess.
giampaolo/psutil#2288

We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276

To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.

Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.

Screenshots or logs
Here is a traceback we are seeing:

  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 99, in start_model_server
    mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 199, in _retry_retrieve_mms_server_process
    return retrieve_mms_server_process()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 206, in _retrieve_mms_server_process
    if MMS_NAMESPACE in process.cmdline():
  File "/usr/local/lib64/python3.8/site-packages/psutil/__init__.py", line 702, in cmdline
    return self._proc.cmdline()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1650, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1788, in cmdline
    self._raise_if_zombie()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information

  • sagemaker inference version 1.5.11
  • custom docker image based on amazon linux 2
    • framework name: scikit-learn
    • framework version: 1.0.2
    • Python version: 3.8
    • processing unit type: cpu

Additional context
n/a

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions