fix: Ignore pod patching failures during startup #397
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix: Ignore pod patching failures during startup
Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org
Summary
Patch failures in
pod_webhook.Start()
abort thecontroller
process and may result in crash looping.We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in
Start()
.As I believe the code within
Start()
is an optimization to eagerly createCachedImages
, but otherwise not required to function properly, this patch will instead log theerr
and continue on to the next pod.Patch Failure Details
Here's a sample of the patch failure:
I've no idea why the patch logic occasionally is converting from
Values: {"on-demand"}
toValues: []string{"on-demand"}
only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on theKey
, however I was unable to replicate the error. This style ofAffinity
config exists in many other pods in the cluster.