Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 14, 2026

Issue describing the changes in this PR

Under concurrent load, multiple threads acquire the _hostSecretsLock serially, each performing redundant secret loading operations. This causes excessive log noise and potentially contributes to DataProtection library failures due to high call volume.

Changes:

  • Added double-check locking in GetHostSecretsAsync - After acquiring lock, verify _hostSecrets is still null before loading
  • Added double-check locking in GetFunctionSecretsAsync - After acquiring lock, verify function secrets aren't already cached before loading
  • Added concurrency tests - Validate "Loading secrets" log appears exactly once despite concurrent calls
  • Updated release notes - Added entry in release_notes.md describing the fix
// Before: All threads waiting on lock would load secrets
await _hostSecretsLock.WaitAsync();
try
{
    _logger.LogDebug("Loading host secrets");
    hostSecrets = await LoadSecretsAsync<HostSecrets>();
    // ...
}

// After: Only first thread loads; others return cached value
await _hostSecretsLock.WaitAsync();
try
{
    if (_hostSecrets is not null)
    {
        return _hostSecrets;
    }
    _logger.LogDebug("Loading host secrets");
    hostSecrets = await LoadSecretsAsync<HostSecrets>();
    // ...
}

Pull request checklist

IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.

  • Backporting to the in-proc branch is not required
    • Otherwise: Link to backporting PR
  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • My changes do not require diagnostic events changes
    • Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
  • I have added all required tests (Unit tests, E2E tests)

Additional information

This addresses CRI #51000000793837 where 72 threads serially loaded secrets, each logging "Loading host secrets" and stressing the DataProtection library. The fix reduces both computational overhead and log volume while potentially resolving related DataProtection failures.

Original prompt

This section details on the original issue you should resolve

<issue_title>Race condition in SecretManager secret caching</issue_title>
<issue_description>In investigating CRI https://portal.microsofticm.com/imp/v5/incidents/details/51000000793837/summary, telemetry shows that the caching logic we have here for host secrets isn't working correctly under concurrent access. What can happen (and what we saw in the CRI) is that 72 threads can all come in and acquire the lock serially, meaning each thread will log the "Loading host secrets" message then one at a time go through the process to load the secrets from storage. The last one will win. We need to employ a double check strategy to prevent this:

public async virtual Task<HostSecretsInfo> GetHostSecretsAsync()
{
    if (_hostSecrets == null)
    {
        using (_metricsLogger.LatencyEvent(GetMetricEventName(MetricEventNames.SecretManagerGetHostSecrets)))
        {
            await _hostSecretsLock.WaitAsync();

            try
            {
                // FIX: Double-check after acquiring lock
                if (_hostSecrets is not null)
                {
                    return _hostSecrets;
                }

                _logger.LogDebug("Loading host secrets");

                HostSecrets hostSecrets = await LoadSecretsAsync<HostSecrets>();
                
                // ... rest of the method

This will prevent all the unnecessary work that is happening (and the log noise). Note that the same fix should also be applied to GetFunctionSecretsAsync.

In addition, I suspect that this will also address a DataProtection issue that also happened in this CRI. For that issue, the host is showing the following error and stack trace:

Non-decryptable host secrets detected. Refreshing secrets. Exception: System.Security.Cryptography.CryptographicException: The provided payload could not be decrypted. Refer to the inner exception for more information. For more information go to http://aka.ms/dataprotectionwarning
 ---> System.InvalidOperationException: The key ring does not contain a valid default protection key. The data protection system cannot create a new key because auto-generation of keys is disabled. For more information go to http://aka.ms/dataprotectionwarning
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.CreateCacheableKeyRingCore(DateTimeOffset now, IKey keyJustAdded)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.Microsoft.AspNetCore.DataProtection.KeyManagement.Internal.ICacheableKeyRingProvider.GetCacheableKeyRing(DateTimeOffset now)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.GetCurrentKeyRingCore(DateTime utcNow, Boolean forceRefresh)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingProvider.GetCurrentKeyRing()
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.UnprotectCore(Byte[] protectedData, Boolean allowOperationsOnRevokedKeys, UnprotectStatus& status)
   --- End of inner exception stack trace ---
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.UnprotectCore(Byte[] protectedData, Boolean allowOperationsOnRevokedKeys, UnprotectStatus& status)
   at Microsoft.AspNetCore.DataProtection.KeyManagement.KeyRingBasedDataProtector.DangerousUnprotect(Byte[] protectedData, Boolean ignoreRevocationErrors, Boolean& requiresMigration, Boolean& wasRevoked)
   at Microsoft.Azure.WebJobs.Script.WebHost.DataProtectionKeyValueConverter.Unprotect(Key key) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/DataProtectionKeyValueConverter.cs:line 42
   at Microsoft.Azure.WebJobs.Script.WebHost.DataProtectionKeyValueConverter.ReadValue(Key key) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/DataProtectionKeyValueConverter.cs:line 29
   at Microsoft.Azure.WebJobs.Script.WebHost.SecretManager.ReadHostSecrets(HostSecrets hostSecrets) in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/SecretManager.cs:line 741
   at Microsoft.Azure.WebJobs.Script.WebHost.SecretManager.GetHostSecretsAsync() in /_/src/WebJobs.Script.WebHost/Security/KeyManagement/SecretManager.cs:line 123.

It might be that the high volume of calls to the DP library caused this failure.</issue_description>

<agent_instructions>Apply the fix suggested by Mathew in his description.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: fabiocav <2507935+fabiocav@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix race condition in SecretManager secret caching Fix race condition in SecretManager secret caching Jan 14, 2026
Copilot AI requested a review from fabiocav January 14, 2026 21:42
@fabiocav
Copy link
Member

@copilot please also update the release_notes.md file with a description of this change (along to a reference to this PR)

Co-authored-by: fabiocav <2507935+fabiocav@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 14, 2026

@copilot please also update the release_notes.md file with a description of this change (along to a reference to this PR)

Updated release_notes.md with a description of the fix in commit 33dad3a. The PR number is set as #TBD and will need to be updated once the PR number is assigned.

@fabiocav fabiocav marked this pull request as ready for review January 14, 2026 22:59
@fabiocav fabiocav requested a review from a team as a code owner January 14, 2026 22:59
Copilot AI review requested due to automatic review settings January 14, 2026 22:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in SecretManager where multiple concurrent threads could serially acquire locks and redundantly load secrets from storage, causing excessive logging and potential DataProtection library failures. The fix implements double-check locking in both GetHostSecretsAsync and GetFunctionSecretsAsync methods.

Changes:

  • Added double-check locking pattern in GetHostSecretsAsync to verify _hostSecrets is null after lock acquisition
  • Added double-check locking pattern in GetFunctionSecretsAsync to verify function secrets aren't already cached after lock acquisition
  • Added concurrency tests to validate that secrets are loaded only once despite concurrent calls
  • Updated release notes with entry describing the fix

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/WebJobs.Script.WebHost/Security/KeyManagement/SecretManager.cs Implements double-check locking in both secret loading methods to prevent redundant loads
test/WebJobs.Script.Tests/Security/SecretManagerTests.cs Adds concurrency tests validating that loading log messages appear exactly once
release_notes.md Documents the race condition fix

fabiocav and others added 2 commits January 15, 2026 13:19
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@fabiocav
Copy link
Member

/azp run host.public

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

release_notes.md Outdated
<!-- Please add your release notes in the following format:
- My change description (#PR)
-->
- Fix race condition in SecretManager secret caching with double-check locking pattern (#TBD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this TBD here?

@jviau
Copy link
Contributor

jviau commented Jan 23, 2026

/azp run host.public

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

<!-- Please add your release notes in the following format:
- My change description (#PR)
-->
-->
Copy link
Member

@liliankasem liliankasem Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Update the release notes and set the issue number to 11560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in SecretManager secret caching

5 participants