Skip to content

[bugfix]: add presuspend sanity check and fix a bug in retry#275

Merged
zhongkechen merged 5 commits intomainfrom
checks
Apr 2, 2026
Merged

[bugfix]: add presuspend sanity check and fix a bug in retry#275
zhongkechen merged 5 commits intomainfrom
checks

Conversation

@zhongkechen
Copy link
Copy Markdown
Contributor

@zhongkechen zhongkechen commented Mar 26, 2026

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Issue Link, if available

Fixes #291

Description

Just before go suspend, we do sanity check to make sure:

  • it's not suspended more than once
  • there is at least one operation pending completion, otherwise, the backend will throw an exception

This change is to help us detect possible issues with suspend in our test cases, not to prevent them.

bug fix in runUserHandler

Found a bug in runUserHandler with the additional check.

The future was completed but the step was not retried because the following check in runUserHandler failed.

        if (runningUserHandler.get() != null) {
            throw new IllegalStateException("User handler already running");
        }

For a failed operation that is polling in the background, the runningUserHandler is already set. The exception thrown from here was swallowed because it's chained to a future.

The fix is not only checking if it's null, but also check if it's done and the execution will be terminated if it's not null and not done, meaning a user function is still running.

Demo/Screenshots

Checklist

  • I have filled out every section of the PR template
  • I have thoroughly tested this change

Testing

Unit Tests

Have unit tests been written for these changes?

Integration Tests

Have integration tests been written for these changes?

Examples

Has a new example been added for the change? (if applicable)

@zhongkechen
Copy link
Copy Markdown
Contributor Author

With this change we found a bug in Step operation: #295

@zhongkechen zhongkechen marked this pull request as ready for review March 31, 2026 23:56
@zhongkechen zhongkechen requested a review from a team March 31, 2026 23:56
@zhongkechen
Copy link
Copy Markdown
Contributor Author

zhongkechen commented Apr 1, 2026

The retry example still failing: https://github.com/aws/aws-durable-execution-sdk-java/actions/runs/23824927742/job/69445906795

The step operation failed at 45.429 second

{
    "@timestamp": "2026-04-02T16:40:45.429Z",
    "ecs.version": "1.2.0",
    "log.level": "WARN",
    "message": "Async operation failing on attempt 1",
    "process.thread.name": "durable-exec-16",
    "log.logger": "software.amazon.lambda.durable.examples.step.RetryInProcessExample"
}

the polling request properly made afte ~1 second

{
    "@timestamp": "2026-04-02T16:40:46.848Z",
    "ecs.version": "1.2.0",
    "log.level": "DEBUG",
    "message": "Calling durable checkpoint API with 0 updates: []",
    "process.thread.name": "durable-sdk-internal-1",
    "log.logger": "software.amazon.lambda.durable.execution.CheckpointManager"
}

And the response contains the READY operation

NewExecutionState=CheckpointUpdatedExecutionState(Operations=[Operation(Id=6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b, Name=flaky-async-operation, Type=STEP, StartTimestamp=2026-04-02T16:40:45.700Z, Status=READY, StepDetails=StepDetails(Attempt=1, Error=ErrorObject(ErrorMessage=*** Sensitive Data Redacted ***, ErrorType=*** Sensitive Data Redacted ***, ErrorData=*** Sensitive Data Redacted ***, StackTrace=*** Sensitive Data Redacted ***)))])).

The poller future was completed

{
    "@timestamp": "2026-04-02T16:40:46.904Z",
    "ecs.version": "1.2.0",
    "log.level": "DEBUG",
    "message": "1 operations processed and 1 pollers completed (latency=625002ns). ",
    "process.thread.name": "durable-sdk-internal-1",
    "log.logger": "software.amazon.lambda.durable.execution.CheckpointManager"
}

The future was completed but the step was not retried because the following check in runUserHandler failed.

        if (runningUserHandler.get() != null) {
            throw new IllegalStateException("User handler already running");
        }

For a failed operation that is polling in the background, the runningUserHandler is already set. The exception thrown from here was swallowed because it's chained to a future.

@zhongkechen zhongkechen added the bug Something isn't working label Apr 2, 2026
@zhongkechen zhongkechen changed the title add presuspend sanity check [bugfix]: add presuspend sanity check and fix a bug in retry Apr 2, 2026
Copy link
Copy Markdown
Contributor

@wangyb-A wangyb-A left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test case for this bug fix?

@zhongkechen zhongkechen merged commit 4ee9ae8 into main Apr 2, 2026
11 of 14 checks passed
@zhongkechen zhongkechen deleted the checks branch April 2, 2026 20:09
@zhongkechen
Copy link
Copy Markdown
Contributor Author

Can we have a test case for this bug fix?

The new validation logic added in this PR caused existing cases to fail. No extra test case required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add pre-suspend validation

2 participants