feat: add retry utility with exponential backoff by bnusunny · Pull Request #17 · bnusunny/aws-sam-cli

bnusunny · 2026-03-16T04:08:24Z

Adds a reusable retry decorator with exponential backoff and jitter for handling transient failures.

Add a manually-triggered GitHub Actions workflow that runs test_download_two_layers 5 times with debug instrumentation to diagnose why layers sometimes return 'Layer1' instead of 'Layer2' on Docker (but not Finch). Instrumentation logs: - Layer ordering in download_all() input/output - Generated Dockerfile ADD command sequence - Tarball construction order - Image build vs reuse decisions

Instead of running isolated iterations, run the exact same pytest command as the CI local-invoke suite with added debug logging. This reproduces the real test ordering and cross-test interactions. Enhanced instrumentation now also captures: - Per-layer download decisions (cached vs fresh) - Extracted layer file contents after download - Docker build stream output (cache hits show as CACHED)

Run just the TestLayerVersion class instead of the full local-invoke suite. This covers the failing test_download_two_layers plus the other tests in the class that may cause cross-test contamination.

debug: add workflow to investigate flaky layer ordering tests

debug: remove repo guard for fork

debug: use direct OIDC auth

debug: add BY_CANARY=true

…rm container tests * Update notify-slack.yml to add sleep (aws#8646) * fix: use tag prefix matching to clean up samcli/lambda-* images (aws#8647) Docker's images.list(name='samcli/lambda') does exact repository matching and won't match repositories like 'samcli/lambda-python'. This caused stale images to persist across parameterized test classes, leading to flaky test_download_two_layers failures where Layer2 should overwrite Layer1 but the image was never rebuilt. Fix by: 1. Adding _cleanup_samcli_images() that lists all images and filters by 'samcli/lambda-' tag prefix 2. Using this method in both tearDown and tearDownClass 3. Fixing the same pattern in TestLayerVersionThatDoNotCreateCache * fix: remove fallback in count_running_containers that causes flaky warm container tests The count_running_containers method had a fallback that returned the count of ALL SAM CLI containers when MODE env var filtering found no matches. This caused AssertionError: 3 != 2 when stale containers from other tests were present. Now it strictly counts only containers matching this test's unique MODE UUID and uses exact string matching. * fix: add AWS_DEFAULT_REGION and use -k pattern for parameterized tests * fix: add BY_CANARY=true to enable Docker tests on CI * fix: exclude RemoteLayers tests that need AWS credentials --------- Co-authored-by: seshubaws <116689586+seshubaws@users.noreply.github.com>

When build_in_source is used with Node.js, the build directory contains a node_modules symlink. During local invoke, SAM CLI resolves this symlink and creates an additional bind mount. Docker tolerates creating a mountpoint over a symlink, but Finch (containerd/runc) fails with 'not a directory'. This fix temporarily replaces symlinks with empty directories before container creation, then restores them afterward. This ensures: - Finch/runc gets a valid directory mountpoint - Docker continues to work as before - Repeated invocations work because symlinks are restored - Host filesystem is left unchanged even if container creation fails

fix: replace symlinks with dirs for Finch container mount compatibility

fix: restore BY_CANARY to run docker tests on CI

Containerd/Finch resolves bind mounts at start time, not create time. Moving the symlink restore to after start() ensures the empty directory mountpoints are still present when the container runtime sets up mounts.

fix: move symlink restore to after container start

Bind mounts must remain valid for the container's entire lifetime. Both Docker and Finch need the empty directory mountpoint to persist until the container is stopped and deleted. Restoring symlinks in the delete() method's finally block ensures proper cleanup alongside the existing host_tmp_dir cleanup.

fix: restore symlinks at container delete time

Implement a local CloudFormation Language Extensions processor supporting: - Fn::ForEach loop expansion in Resources, Conditions, and Outputs - Fn::Length, Fn::ToJsonString intrinsic functions - Fn::FindInMap with DefaultValue support - Conditional DeletionPolicy/UpdateReplacePolicy - Nested ForEach depth validation (max 5 levels) - Partial resolution mode preserving unresolvable references Pipeline architecture: TemplateParsingProcessor -> ForEachProcessor -> IntrinsicResolverProcessor -> DeletionPolicyProcessor -> UpdateReplacePolicyProcessor Includes comprehensive unit tests and CloudFormation compatibility suite.

Wire the language extensions library into SAM CLI with two-phase architecture: - Phase 1: expand_language_extensions() -> LanguageExtensionResult - Phase 2: SamTranslatorWrapper.run_plugins() (SAM transform only) Key components: - expand_language_extensions() canonical entry point with template-level cache keyed on (path, mtime, params_hash) - SamTranslatorWrapper receives pre-expanded template (Phase 2 only) - SamLocalStackProvider.get_stacks() calls expand_language_extensions() - SamTemplateValidator calls expand_language_extensions() - DynamicArtifactProperty dataclass for Mappings transformation - Fn::ForEach guards in artifact_exporter, normalizer, cdk/utils - clear_expansion_cache() for warm container file change events

- _get_template_for_output() preserves Fn::ForEach in build output - _update_foreach_artifact_paths() generates Mappings for dynamic artifact properties with per-function build paths - Recursive nested Fn::ForEach support - ForEach-aware path resolution skips Docker image URIs Test templates: static CodeUri, dynamic CodeUri, parameter collections, nested stacks, nested ForEach, dynamic ImageUri, depth validation.

Package: - _export() calls expand_language_extensions() for Phase 1 - Preserves Fn::ForEach in packaged template with S3 URIs - Generates Mappings for dynamic artifact properties - _find_artifact_uri_for_resource() handles all export formats: string, {S3Bucket,S3Key}, {Bucket,Key}, {ImageUri} - Recursive nested Fn::ForEach support - Warning for parameter-based collections Deploy: - Uploads original unexpanded template to CloudFormation - Clear error for missing Mapping keys Integration tests for CodeUri, ContentUri, DefinitionUri, ImageUri, BodyS3Location across all packageable resource types.

- sam validate: valid ForEach, invalid syntax, cloud-dependent collections, dynamic CodeUri, nested depth validation (5 valid, 6 invalid) - sam local invoke: expanded function names from ForEach - sam local start-api: ForEach-generated API endpoints

Add make test-lang-ext and make test-all targets so the 1695 language extensions unit tests only run when needed, keeping the default make test fast for unrelated PRs.

The expand_language_extensions() cache stored references to template dicts that were later mutated in-place by ApplicationBuilder.update_template() (which changes nested stack Location properties to build-output paths). On cache hit, the mutated dict was returned, causing TemplateNotFoundException during the second infra sync in sam sync --watch. Remove the cache entirely since deep-copying on hit negates the performance benefit and adds complexity. Keep clear_expansion_cache() as a no-op for backward compatibility. Fixes TestSyncInfraNestedStacks_0 and TestSyncInfraNestedStacks_1 integration test failures.

fix: remove expansion cache to fix sync watch nested stack failures

fix: use OIDC credentials directly for sync test workflow

cortex-assistant

PR Review: feat: add retry utility with exponential backoff

Thanks for this refactor! The new retry_with_backoff implementation is a clear improvement over the original retry — adding jitter, max delay capping, proper logging, and a cleaner API. However, there are several issues that need to be addressed before this can be merged.

🔴 Breaking Change — Existing callers will break

This PR removes the retry function entirely and replaces it with retry_with_backoff, which has a completely different signature. The existing codebase has callers that import and use the old API, e.g.:

# samcli/local/docker/container.py
from samcli.lib.utils.retry import retry

The old retry() decorator takes exc as a positional arg, plus exc_raise and exc_raise_msg parameters — none of which exist in the new API. This PR will break all existing callers at runtime.

Recommendation: Either:

Keep the old retry() function (deprecated) alongside the new retry_with_backoff() for backward compatibility, or
Update all call sites in the same PR to use the new API.

🔴 No tests included

This PR modifies core retry logic but includes no new or updated tests. The existing tests for the old retry function will also be broken. Please add unit tests covering:

Successful call on first attempt (no retry)
Successful call after N transient failures
All retries exhausted → exception re-raised
max_delay cap is respected
Only specified exceptions are caught (other exceptions propagate immediately)
Jitter produces non-deterministic delays (mock random.uniform)

🟡 Behavioral difference in exception re-raising

The old function raised a custom exception (exc_raise(exc_raise_msg)) after retries were exhausted. The new function re-raises the last caught exception. This is arguably better behavior (preserves the original traceback), but it's a semantic change that callers may depend on. Please document this intentional change in the PR description.

🟡 Minor code-level comments below.

Overall the implementation logic is sound — the exponential backoff + jitter + max_delay pattern is well-established. Just needs the surrounding issues addressed.

cortex-assistant · 2026-03-18T05:58:02Z

samcli/lib/utils/retry.py

-    """
-
-    def retry_wrapper(func):
+def retry_with_backoff(max_retries=3, base_delay=1.0, max_delay=30.0, exceptions=(Exception,)):


🟡 exceptions parameter type mismatch with except clause

The default exceptions=(Exception,) is a tuple, which works correctly with except. However, if a caller passes a single exception class (not wrapped in a tuple), e.g. retry_with_backoff(exceptions=ValueError), this will also work because except accepts both. That said, it would be good to add a type hint to make the expected input clear:

from typing import Type, Tuple, Union def retry_with_backoff( max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 30.0, exceptions: Union[Type[BaseException], Tuple[Type[BaseException], ...]] = (Exception,), ):

cortex-assistant · 2026-03-18T05:58:02Z

samcli/lib/utils/retry.py

-
+                except exceptions as e:
+                    last_exception = e
+                    if attempt == max_retries:


🟡 Jitter is additive, not multiplicative — The current formula:

delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)

Adds a fixed [0, 1) second jitter regardless of the backoff magnitude. For very small base_delay values (e.g., 0.01s), the jitter dominates. For very large delays (e.g., 16s), the jitter is negligible.

Consider using proportional jitter instead:

delay = min(base_delay * (2 ** attempt), max_delay) delay = delay * random.uniform(0.5, 1.0) # "decorrelated" or "equal" jitter

Or at minimum, document the current jitter strategy so callers understand the behavior.

cortex-assistant · 2026-03-18T05:58:02Z

samcli/lib/utils/retry.py

+                    if attempt == max_retries:
+                        break
+                    delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
+                    LOG.warning("Retry %d/%d for %s: %s (%.1fs)", attempt + 1, max_retries, func.__name__, e, delay)


🟢 Nice addition of logging. This is a great improvement over the old silent retry. The log format is clear and actionable.

Minor nit: consider using LOG.info instead of LOG.warning for expected transient retries — warning level may be too noisy in production if retries are a normal/expected path. Alternatively, make the log level configurable.

cortex-assistant · 2026-03-18T05:58:02Z

samcli/lib/utils/retry.py

+                    delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
+                    LOG.warning("Retry %d/%d for %s: %s (%.1fs)", attempt + 1, max_retries, func.__name__, e, delay)
+                    time.sleep(delay)
+            raise last_exception


🔴 raise last_exception loses the original traceback in Python 3.

When you do raise last_exception, Python creates a new traceback from this point. To preserve the original traceback, use:

raise last_exception from last_exception

Or even better, use bare raise inside the except block:

except exceptions as e: last_exception = e if attempt == max_retries: raise # preserves original traceback

This restructuring would preserve the full stack trace for debugging.

bnusunny and others added 30 commits February 11, 2026 17:07

debug: narrow tests to TestLayerVersion class only

0316424

Run just the TestLayerVersion class instead of the full local-invoke suite. This covers the failing test_download_two_layers plus the other tests in the class that may cause cross-test contamination.

Merge pull request #1 from bnusunny/aws-sam-cli-tests

5766c7d

debug: add workflow to investigate flaky layer ordering tests

debug: remove repo guard so workflow runs on fork

648dd76

Merge pull request #2 from bnusunny/aws-sam-cli-tests

9bb7e7e

debug: remove repo guard for fork

debug: use direct OIDC auth, remove credential vending system

71bad0d

Merge pull request #3 from bnusunny/aws-sam-cli-tests

73dd430

debug: use direct OIDC auth

debug: add BY_CANARY=true to prevent layer tests from being skipped

d7225c6

Merge pull request #4 from bnusunny/aws-sam-cli-tests

d8b0f82

debug: add BY_CANARY=true

simplify verification workflow: use OIDC credentials directly

3db71fe

Merge pull request #6 from bnusunny/fix-finch-symlink-mount

19bef71

fix: replace symlinks with dirs for Finch container mount compatibility

fix: restore BY_CANARY env var to prevent test skip on CI

3d26602

Merge pull request #7 from bnusunny/fix-finch-symlink-mount

06150f0

fix: restore BY_CANARY to run docker tests on CI

fix: move symlink restore to after container start

0d5d705

Containerd/Finch resolves bind mounts at start time, not create time. Moving the symlink restore to after start() ensures the empty directory mountpoints are still present when the container runtime sets up mounts.

Merge pull request #8 from bnusunny/fix-finch-symlink-mount

391a2ec

fix: move symlink restore to after container start

Merge pull request #9 from bnusunny/fix-finch-symlink-mount

9ca999e

fix: restore symlinks at container delete time

chore: skip cfn_language_extensions tests in default make test target

e64121d

Add make test-lang-ext and make test-all targets so the 1695 language extensions unit tests only run when needed, keeping the default make test fast for unrelated PRs.

Merge pull request #10 from bnusunny/feat-language-extension

ce292ab

fix: remove expansion cache to fix sync watch nested stack failures

fix: use OIDC credentials directly for sync test workflow

70281b4

Merge pull request #11 from bnusunny/feat-language-extension

1180317

fix: use OIDC credentials directly for sync test workflow

bnusunny and others added 2 commits March 12, 2026 11:31

Merge branch 'aws:develop' into develop

5486e43

feat: add retry utility with exponential backoff and jitter

f156745

github-actions bot added the pr/internal label Mar 16, 2026

cortex-assistant bot requested changes Mar 18, 2026

View reviewed changes

bnusunny force-pushed the develop branch from 29ecb83 to 2ee0f3e Compare March 19, 2026 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retry utility with exponential backoff#17

feat: add retry utility with exponential backoff#17
bnusunny wants to merge 32 commits intodevelopfrom
test/microvm-pr-review-v2

bnusunny commented Mar 16, 2026

Uh oh!

cortex-assistant bot left a comment

Uh oh!

cortex-assistant bot Mar 18, 2026

Uh oh!

cortex-assistant bot Mar 18, 2026

Uh oh!

cortex-assistant bot Mar 18, 2026

Uh oh!

cortex-assistant bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bnusunny commented Mar 16, 2026

Uh oh!

cortex-assistant bot left a comment

Choose a reason for hiding this comment

PR Review: feat: add retry utility with exponential backoff

🔴 Breaking Change — Existing callers will break

🔴 No tests included

🟡 Behavioral difference in exception re-raising

🟡 Minor code-level comments below.

Uh oh!

cortex-assistant bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

cortex-assistant bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

cortex-assistant bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

cortex-assistant bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant