Skip to content

Race Condition in Workflow State Management During Startup #753

@drockparashar

Description

@drockparashar

Description

There is a race condition between workflow state saving and retrieval during workflow startup that causes intermittent "not found" errors when activities try to access workflow configuration from the StateStore.

When a workflow is initiated using TemporalClient.start_workflow(), the following steps take place:

  1. The workflow configuration is saved to the StateStore through StateStore.save_state_object().

  2. The workflow starts immediately.

  3. During the first activity of the workflow, get_workflow_args() is called to retrieve the configuration via StateStore.get_state().

  4. Race condition: The configuration may not be available yet, causing "not found" errors

Root Cause Analysis

File: application_sdk\clients\temporal.py (lines 262-280)
The workflow startup sequence has a timing gap:

Image

File: application_sdk\services\statestore.py (lines 243-267)
The save_state_object() method has asynchronous operations:

Image

The get_state() method fails when object store upload hasn't completed:

Image

Impact:

  • Intermittent workflow failures during startup.
  • Timing-dependent behavior that is difficult to reproduce consistently.
  • Users are compelled to implement workarounds with retry logic in their activities.

My current workaround is implementing retry logic in their get_workflow_args() overrides:

Image

Reproduction Repo/Script (if any)

https://github.com/drockparashar/githubConnector

Reproduction Steps

1. Import SDK
2. Call method `...`
3. Observe error `...`

Logs / Tracebacks

Expected vs Actual

Expected Behavior: When a workflow is initiated using TemporalClient.start_workflow(), the workflow configuration should be instantly accessible to activities that invoke get_workflow_args(). The expected sequence of events is as follows:

  1. StateStore.save_state_object() saves the workflow configuration to the object store.
  2. The method returns successfully, indicating that the state has been persisted and is available.
  3. The Temporal workflow starts and calls get_workflow_args().
  4. StateStore.get_state() successfully retrieves the configuration.
  5. The workflow proceeds without any errors.

Actual Behavior: The workflow configuration is intermittently unavailable immediately after being saved, resulting in "not found" errors:

  1. StateStore.save_state_object() appears to complete successfully.
  2. The Temporal workflow starts right away and invokes get_workflow_args().
  3. StateStore.get_state() fails with an "object not found" error at lines 113-117 in statestore.py.
  4. As a result, the workflow fails to start and generates a "No state found" message.

Environment

None

SDK Version

0.1.1rc46

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions