DHCP lease maintenance should terminate when interface no longer exists.#1143
Conversation
3cb9c6b to
20e6ba9
Compare
20e6ba9 to
0834450
Compare
plugins/ipam/dhcp/lease.go
Outdated
|
|
||
| ctx, cancel := context.WithTimeout(ctx, linkCheckTotalTimeout) | ||
| defer cancel() | ||
| _, err := backoffRetry(ctx, linkCheckRetryMax, checkFunc) |
There was a problem hiding this comment.
the only downside here is that backoffRetry() sets a "base delay" using the resendDelay0 const which was from the DHCP RFC stuff, so, it's 4 seconds the first time. This could probably be quicker? Will happily rejig backoffRetry to take a base delay parameter.
There was a problem hiding this comment.
oops missed a major part of this... fix incoming.
|
Soooo.... yeah what I have currently pushed has some things I totally whiffed on and I'm working on correcting it, but, in an ideal world, the netlink library release v1.3.0 doesn't include the commit to detect the ErrDumpInterrupted I'm looking into working around it until that can get released, which I also requested in vishvananda/netlink#1019 |
0834450 to
0249707
Compare
plugins/ipam/dhcp/lease.go
Outdated
| return exists, err | ||
| } | ||
|
|
||
| func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, f func() (bool, error)) (bool, error) { |
There was a problem hiding this comment.
I wound up creating another backoff method because I needed to have the retry method f func() (bool, error) return a bool and an error, so, I additionally added linkCheckDelay0
plugins/ipam/dhcp/lease.go
Outdated
| if err != nil { | ||
| log.Printf("!bang: Error checking network link '%s': %v, type: %v", l.link.Attrs().Name, err, reflect.TypeOf(err)) | ||
| var linkNotFoundErr *netlink.LinkNotFoundError | ||
| if errors.As(err, &linkNotFoundErr) { |
There was a problem hiding this comment.
This wasn't evaluating to true even though I was seeing logs as:
2025/01/21 21:36:46 Error checking network link 'net1': Link not found, type: netlink.LinkNotFoundError
I'll remove the reflect package and also the lines marked with !bang once I get it sorted.
ea7c4f0 to
55262dc
Compare
|
I'm going to change to checking for the string That would never evaluate as true, even when I could find the error as |
c0f73b9 to
78d100b
Compare
You're passing the address of an empty pointer (pointer of pointer) to This should do the job Try it in https://go.dev/play/p/GZJO1RdVR8g |
78d100b to
7758510
Compare
plugins/ipam/dhcp/lease.go
Outdated
| _, err := netlink.LinkByName(l.link.Attrs().Name) | ||
| if err != nil { | ||
| var linkNotFoundErr *netlink.LinkNotFoundError = &netlink.LinkNotFoundError{} | ||
| if errors.As(err, linkNotFoundErr) { |
There was a problem hiding this comment.
Owe you one @zeeke !!! Thanks. Appreciate that, it works a charm and much cleaner than the string check, thanks.
plugins/ipam/dhcp/lease.go
Outdated
| ctx, cancel := context.WithTimeout(ctx, linkCheckTotalTimeout) | ||
| defer cancel() | ||
|
|
||
| exists, err := backoffRetryLinkExists(ctx, linkCheckRetryMax, checkFunc) |
There was a problem hiding this comment.
linkCheckRetryMax is constant variable, defined as global scope in this file. So we may remove it from function argument. Otherwise, remove constant variable, linkCheckRetryMax and define it in function variable.
There was a problem hiding this comment.
thanks, simplified so that it isn't passed through the method param and just used directly there, good call
plugins/ipam/dhcp/lease.go
Outdated
| func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, f func() (bool, error)) (bool, error) { | ||
| baseDelay := linkCheckDelay0 | ||
| for { | ||
| exists, err := f() |
There was a problem hiding this comment.
Currently, this function is pretty for backoffRetryLinkExists, not for common retry, hence we may not required to have f func() (bool, error)) as argument, I guess. Instead of that, having linkName in function argument makes it simple, I feel.
func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, linkName string) (bool, error) {
checkFunc := func() (bool, error) {
_, err := netlink.LinkByName(linkName)
if err != nil {
var linkNotFoundErr *netlink.LinkNotFoundError = &netlink.LinkNotFoundError{}
if errors.As(err, linkNotFoundErr) {
return false, nil
}
return false, err
}
return true, nil
}
There was a problem hiding this comment.
Thanks Tomo, nice simplification. I did wind up breaking out a method checkLinkByName to use there, and then just check it in a for loop in a condensed version of the checkLinkExistsWithBackoff method.
plugins/ipam/dhcp/lease.go
Outdated
|
|
||
| func (l *DHCPLease) checkLinkExistsWithBackoff(ctx context.Context) (bool, error) { | ||
| checkFunc := func() (bool, error) { | ||
| _, err := netlink.LinkByName(l.link.Attrs().Name) |
There was a problem hiding this comment.
If we assume that this link is already removed, then l.link.Attrs() may be invalid. We should store interface name ahead.
There was a problem hiding this comment.
interesting, I didn't really account for this, even though it seems it was pulling up the correct name. However, just in case this does happen in other situations, I instead opted to store linkName as a property of the DHCPLease object. Thanks!
Due to oberservations that threads can grow and the dhcp daemon uses an increasing amount of memory. This situation can happen organically when using say, bridge CNI, and the bridge has been removed outside of the bridge CNI lifecycle, and an interface no longer exists on a pod. Does so on a retry loop using the `backoffRetry()` method. Signed-off-by: dougbtv <dosmith@redhat.com>
7758510 to
ce0c680
Compare
|
I could not secure the time to review this week. |
|
@squeed Could you check merge criteria and merge that if it is okey? This PR already has two maingainer approval. |
|
Fix included in accepted release 4.19.0-0.nightly-2025-04-02-065200 |
|
Fix included in accepted release 4.19.0-0.nightly-2025-04-02-170034 |
|
Fix included in accepted release 4.19.0-0.nightly-2025-04-04-023411 |
|
Fix included in accepted release 4.19.0-0.nightly-2025-04-04-170728 |
Due to oberservations that threads can grow and the dhcp daemon uses an increasing amount of memory.
This situation can happen organically when using say, bridge CNI, and the bridge has been removed outside of the bridge CNI lifecycle, and an interface no longer exists on a pod.
Also, for reproduction steps (admittedly, openshift biased), see my steps in https://gist.github.com/dougbtv/803316016c1e96a529725013402d5f0d