BLDX-469 feat(temporal): Temporal SDK initial connect does not retry on errors#960
BLDX-469 feat(temporal): Temporal SDK initial connect does not retry on errors#960anbarasantr wants to merge 1 commit intomainfrom
Conversation
📜 Docstring Coverage ReportRESULT: PASSED (minimum: 30.0%, actual: 77.3%) Detailed Coverage Report |
📦 Trivy Vulnerability Scan Results
Report SummaryCould not generate summary table (data length mismatch: 9 vs 8). Scan Result Detailsrequirements.txtuv.lock |
📦 Trivy Secret Scan Results
Report Summary
Scan Result Details✅ No secrets found during the scan for |
|
🛠 Docs available at: https://k.atlan.dev/application-sdk/claude/slack-app-sdk-authnz-update-nfzm4 |
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
|
🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/960 |
d2f40bd to
1093e2f
Compare
…nnection The Temporal SDK initial connection was failing without retry on transient network errors, causing workflow failures during temporary network issues. This change adds: - Exponential backoff retry logic using tenacity library for Client.connect() - Configurable max_retries via environment variable: - ATLAN_WORKFLOW_CONNECTION_MAX_RETRIES (default: 5) - Runtime override via load(max_retries=N) parameter - Uses tenacity's wait_exponential (min=1s, max=60s) for backoff - Added tenacity as explicit dependency in workflows optional deps The retry configuration is fully optional - existing code will work without any changes, with sensible defaults (5 retries with exponential backoff). https://claude.ai/code/session_01QpXdtkkML7DM6EswP1EAiS
1093e2f to
89cc8c3
Compare
| await asyncio.sleep(60) # Wait 1 minute before retrying | ||
|
|
||
| async def load(self) -> None: | ||
| async def load(self, max_retries: int | None = None) -> None: |
There was a problem hiding this comment.
This seems unnecessarily verbose + complex (+ breaks Python < 3.10, apparently). Why not just:
async def load(self, max_retries: int = WORKFLOW_CONNECTION_MAX_RETRIES) -> None:
| max_retries = ( | ||
| max_retries if max_retries is not None else WORKFLOW_CONNECTION_MAX_RETRIES | ||
| ) |
There was a problem hiding this comment.
This can be dropped entirely if you go with the simpler defaulting above.
| "dapr==1.16.0", | ||
| "temporalio>=1.7.1,<1.22.0", | ||
| "orjson>=3.10.18,<3.12.0", | ||
| "tenacity>=8.0.0,<10.0.0", |
There was a problem hiding this comment.
I'm wary about adding a dependency just for this retry logic in 1 function... (Every dependency in Python is a chance for conflicts with other code or versions (transitive dependencies needed by other libs), aside from extending the surface area for CVEs.)
Is there something here we're strongly dependent on, or could we use something like httpx-retries (given we already use httpx)?
cc: @Aryamanz29 — I seem to remember we had a similar consideration on retries at some point with pyatlan if I'm not mistaken (?)
There was a problem hiding this comment.
Yes @cmgrote - but I think we need custom retries here specifically for the temporal server connection. To clarify the distinction:
httpx-retriesis for HTTP connections viahttpxtenacityis a general-purpose retry library for any function
also, tenacity is already a core dependency of pyatlan (and since pyatlan is a core dependency of apps-sdk), we don't need to add it again under [project.optional-dependencies]^^
| ) | ||
| @patch("application_sdk.clients.temporal.SecretStore.get_deployment_secret") | ||
| async def test_load_with_retry_success_after_failures( | ||
| mock_get_config: AsyncMock, |
There was a problem hiding this comment.
Tests actually sleep here, needs to be mocked.
📦 Example workflows test results
|
📦 Example workflows test results
|
📦 Example workflows test results
|
📦 Example workflows test results
|
Summary
Related