Skip to content

Conversation

@easwars
Copy link
Contributor

@easwars easwars commented Jan 29, 2026

This PR implements the currently in-review gRFC A113: grpc/proposal#535. I've split the PR into logically separate commits to help with the review process.

Summary of changes:

  • Commit 1: simplify the implementation of groupLocalitiesByPriority
    • Change the implementation to use newly added methods in the stdlib maps and slices package to significantly simplify the implementation (and get rid of an unnecessary test)
  • Commit 2: Remove code that handles localities and endpoints of weight 0
    • Remove unnecessary checks for locality and endpoint weights of 0 in cluster_resolver. The xDS client already guarantees that these weights will never be set to 0.
  • Commit 3: add the env var GRPC_EXPERIMENTAL_PF_WEIGHTED_SHUFFLING
  • Commit 4: Weight computation changes in cluster_resolver LB policy
    • This performs the weight normalization and fixed-point arithmetic specified in A113
    • The change here is guarded by the above env var
    • Ended up duplicating the tests that verify the weight computation behavior. This will make it easier to delete the old tests when the env var is removed.
  • Commit 5: Fix a broken test in ring_hash due to the new weight computation
  • Commit 6: Weighted shuffling in pick_first
    • Contains the changes specified in A113 for the pick_first LB policy
    • Changes are guarded by the env var

RELEASE NOTES:

  • pickfirst: Add support for weighted random shuffling of endpoints, as described in gRFC A113

@easwars easwars requested review from arjan-bal and dfawley January 29, 2026 07:07
@easwars easwars added Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Feature New features or improvements in behavior labels Jan 29, 2026
@easwars easwars added this to the 1.79 Release milestone Jan 29, 2026
@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 92.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.17%. Comparing base (69b542a) to head (3f4487d).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
balancer/pickfirst/pickfirst.go 92.59% 1 Missing and 1 partial ⚠️
...rnal/xds/balancer/clusterresolver/configbuilder.go 91.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8864      +/-   ##
==========================================
+ Coverage   82.42%   83.17%   +0.75%     
==========================================
  Files         414      414              
  Lines       32750    32755       +5     
==========================================
+ Hits        26994    27245     +251     
+ Misses       4096     4090       -6     
+ Partials     1660     1420     -240     
Files with missing lines Coverage Δ
balancer/pickfirst/internal/internal.go 100.00% <ø> (ø)
internal/envconfig/envconfig.go 100.00% <ø> (+26.66%) ⬆️
balancer/pickfirst/pickfirst.go 89.28% <92.59%> (-0.34%) ⬇️
...rnal/xds/balancer/clusterresolver/configbuilder.go 92.42% <91.30%> (-1.61%) ⬇️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// will change the order of endpoints but not touch the order of the
// addresses within each endpoint. - A61
if cfg.ShuffleAddressList {
endpoints = append([]resolver.Endpoint{}, endpoints...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should continue cloning the endpoints list here. Directly mutating the ResolverState can lead to data races if the caller of UpdateClientConnState reads the state concurrently. It would also be helpful to add a comment explaining this to prevent the accidental removal of the copy in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. I feel like we need to either make resolver.State.Endpoints an immutable type (EndpointList) or guarantee it's deeply copied when resolver.State is copied.

@easwars easwars assigned arjan-bal and unassigned easwars Jan 30, 2026
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some minor comments.

// Endpoints weights are the product of normalized locality weight and
// endpoint weight, represented as a fixed-point number in uQ1.31 format.
// Locality weights are normalized as:
// P1: locality 0: 80 / (100) = 0.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe locality 0 has weight 20 and locality 3 has weight 80. They seem to be swapped in this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG. This test is so messed up and I've made a copy of it, making it two problems instead of one. I've anyways filed #8868 to cleanup these tests.

}
if diff := cmp.Diff(got, tt.wantConfig); diff != "" {
if diff := cmp.Diff(gotConfig, tt.wantConfig); diff != "" {
t.Errorf("localitiesToWeightedTarget() diff (-got +want) %v", diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message logs the wrong function name. Same issue a couple of lines below and in the corresponding test with weighted sorting enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
resolverEndpoint = weight.Set(resolverEndpoint, weight.EndpointInfo{Weight: endpointWeight})
} else {
resolverEndpoint = weight.Set(resolverEndpoint, weight.EndpointInfo{Weight: locality.Weight * endpoint.Weight})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the changes introduced in this PR, but can't this multiplication result in an overflow since EndpointInfo.Weight is also a uint32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it definitely can overflow. But since no one has complained so far and we will get rid of this code path soonish (once we delete the env var), I'm Ok letting it be as-is.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Jan 30, 2026
// will change the order of endpoints but not touch the order of the
// addresses within each endpoint. - A61
if cfg.ShuffleAddressList {
endpoints = append([]resolver.Endpoint{}, endpoints...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. I feel like we need to either make resolver.State.Endpoints an immutable type (EndpointList) or guarantee it's deeply copied when resolver.State is copied.

Comment on lines +289 to +292
// Here, and in the "else" block below, we clone the endpoints
// slice to avoid mutating the resolver state. Doing the latter
// would lead to data races if the caller is accessing the same
// slice concurrently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important even if it's not a concurrent update. We just shouldn't be mutating the incoming update's endpoints because that likely changes state in the parent.

// in each locality, the probability of selecting the endpoint within the
// locality is 1. Therefore, the target fractions end up as 1/3 and 2/3
// respectively.
const wantRPCs0 = float64(1) / float64(3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW you only need to float64 either the numerator or the denominator.

@easwars easwars merged commit 61d569d into grpc:master Jan 30, 2026
21 of 22 checks passed
@easwars easwars deleted the directpath_soft_affinity branch January 30, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants